After deploying an ML model, there are typically two ways to serve the incoming requests:
Real-time inference
Batch inference
Real-time inference, as the name suggests, processes incoming requests and generates predictions immediately — ChatGPT, for instance.
Batch inference, however, stores the incoming requests and generates predictions later on in large chunks at scheduled intervals, daily, weekly, monthly, etc.
Today, I want to tell you about MyMagic.AI, a tool specifically focused on serving batch inference use cases with several open-source LLMs.
More specifically, in this post, I will:
Discuss the utility, pros and cons of batch inference.
Do a practical demo of MyMagic.AI using their APIs.
Share my opinion about MyMagic.AI.
According to benchmarks, it is the world’s cheapest batch inference service, so it will be fun to learn how to use it.
Let’s begin!
Batch inference
As discussed above, batch inference is suitable in use cases where predictions do not need to be served immediately.
For instance:
Reviews on Amazon typically have an overall product summary. It is not necessary to update the summary with every new review. Instead, the summary can be updated periodically, say, once a week, for many products together.
Financial institutions can run batch processes to evaluate the credit risk of loan applicants. Instead of assessing risk for each applicant in real-time, they can process applications in batches at the end of the day to update their risk profiles.
and more.
The benefits?
There are many.
In real-time inference, a new request can literally come at any time of the day.
Thus, the model must be available in production 24*7 to serve those predictions in real-time, which means increased costs.
For instance, based on the AWS EC2 on-demand pricing, it would cost ~$170/day to maintain a LLaMA 2 model in production. That’s approximately $60000/year.
But these overheads disappear in batch inference as predictions are generated at scheduled intervals.
This significantly reduces the inference costs (both for the API provider and the user) as the model is only loaded at scheduled intervals.
If there were no requests throughout the day, there’s no need to load the model.
Other than that, errors and failures in batch inference can be managed more effectively. If an error occurs, the batch job can be re-run post-rectification. Real-time inference services, however, can be severely impacted due to failures.
A major challenge with batch inference is that predictions are not readily available. So the utility heavily leans towards use cases where a real-time response is not required.
MyMagic.AI demo
The demo below can be done with a free MyMagic account, which you can create here.
After logging in, click on the “Subscribe” button to subscribe to the models currently available. Also, please take note of the API Key.
From here on, the workflow is quite simple and intuitive, as one might expect.
The data we want to use as input for batch inference typically exists in cloud storage, like AWS S3, Azure Blob storage, Snowflake, etc.
We enable read-and-write access to this data source to let MyMagic interact with it:
We invoke the MyMagic API. Being a batch inference call, the request isn’t processed immediately. But it returns a unique identifier (
task_id
), which we can use to check the status of the request later.
Finally, once some time has elapsed, results are written to the data source.
Done!
Let’s do a demo of this using the MyMagic API and AWS S3 as the data source.
To save time, I have already set up an S3 Bucket: “demo-mymagic-bucket.” I have put a text file in this location: <bucket_name>/<API-Key>/Demo-MyMagic
.
Before proceeding ahead, make sure you have created an S3 bucket. Also, as AWS bucket names are global, you must create a bucket with another name.
You can place this text file in the above location: Dummy text file.
Next, to ensure secure and controlled access while adhering to AWS best practices, there are a couple of things we need to do before invoking the API.
Step 1: Create an IAM Policy for Bucket Access
Go to the AWS management console and search for “IAM.” Next, go to Policies
and select Create policy
.
Next, use the JSON editor to create a policy.
Paste the following in the JSON editor (replace “your-bucket-name
” ):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}
Name the policy and then create it.
Step 2: Create an IAM Role for MyMagic API
Go to the AWS management console and search for “IAM” again. Next, select Roles
and click on Create role
.
In the next window, do the following:
The MyMagic AWS Account ID is 537808082884
.
Click Next, select the policy created in Step 1, and again click Next.
Name the role and then create it.
Take note of the ARN provided by AWS, as it will be needed now.
With that, we are done with the setup.
Step 3: Invoke MyMagic API
Now, we can invoke the MyMagic API for batch inference on the above data as follows (the items in bold must be replaced by your properties):
!curl -X POST https://fastapi.mymagic.ai/v1/completions \
-H 'Authorization: Bearer <Your-MyMagic-API-Key>' \
-H 'Content-Type: application/json' \
-d '{"model": "llama2_70b",
"question": "Summarise the file!",
"storage_provider": "s3",
"bucket_name": "<your-bucket-name>",
"session": "Demo-MyMagic",
"role_arn": "<Your-ARN-here>",
"region": "<your-bucket-region>"}'
This returns the following JSON response with a task_id
which we can use to track the status of the request:
We can track the status using the following GET request:
curl --request GET --url https://fastapi.mymagic.ai/get_result/<task-id>
It will show “Pending” for a while, but a successful response will look like the following:
Returning to the S3 Bucket, the response file has been uploaded by MyMagic API.
Done!
Wasn’t that simple?
You can use the same procedure for any other batch inference use case.
A departing note
The interest in batch inference has increased in the last few years.
Earlier this year, OpenAI released Batch API with 50% lower costs for their proprietary models than real-time inference.
Yet, the costs are significantly higher than what other solutions can provide as effectively as OpenAI.
Currently, MyMagic’s API provides batch inference for various variants of LLaMA and Mistral family models.
As per what I learned from the founders, some incredible engineering has allowed them to offer the cheapest batch inference in the world, profitably.
They intend to lead all batch inference on open-source LLMs, and I’m excited to see how they are progressing!
Sign up today for MyMagic.AI here: MyMagic.AI Get Started.
If you want to discuss your batch inference use case with the MyMagic team, here’s their meeting link: Meet MyMagic team.
🙌 Also, a big thanks to the MyMagic.AI team, who very kindly partnered with me on today’s newsletter and let me share my thoughts openly.
👉 Over to you: What are some other advantages of batch inference?
Thanks for reading!
Your way of illustration in every article is very extraordinary and excellently marvellous. Can I know which app do you use for your picture illustration of the learning?