AWS re:Invent 2020: Deploying PyTorch models for inference using TorchServe

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hey everybody, welcome to this session. I'm Shashank Prasanna, I'm one of the co-presenters. And in this session, we'll be talking about deploying Pytorch models for production using open source library called TorchServe. I'm joined with my co-presenter Greeta from Facebook. And for the rest of the session, here's how we structure our two parts of this presentation, I start off by talking about some of the challenges with deploying Pytorch models. And then we'll take a look at TorchServe, which is an open source library jointly developed by AWS and Facebook that helps you deploy Pytorch models to production. We'll take a look at some of the key features and benefits of TorchServe. We'll also see how it works, what makes it tick under the hood. And then we'll spend most of the time in a demo, showing all the APIs and how it works, how to install it, how to get started, how to invoke and get inference results, and so on. We'll also take a look at deploying Pytorch models with Amazon SageMaker. So, let's get started. And once I finished my part, we'll have Greeta talk about some of the best practices for production deployment. So, let's get started. Let's say you're a developer or data scientist, and one of the objectives of training models is to deploy these models into production and host them so that you can help clients either using mobile app or web apps invoke your application and get results. Now, your model is surrounded by pre-processing and post processing steps, which constitutes your business logic. And you want to use this model as an endpoint in the cloud. Now, as you consider deploying your Pytorch models to production, there are a few factors that become very important, you want good performance, you want it to be easy to use, to deploy, you want to make sure it's efficient, cost efficiency is high. Also, you want to be able to scale out deployment. Let's take a closer look at each of these challenges. Performance is very important as you're considering deploying, because you need good throughput, to be able to make sure that you're catering influence requests to large number of customers, but also you want low latency results, especially if you're deploying things like conversational assistants or other latency sensitive applications. You also want it to be easy to deploy. As a developer or data scientist, you want to be able to take your model, and not have to write a lot of pre-processing and post processing steps. But you want it to be very easy to deploy, take your model and deploy it to production. You also want to have high cost efficiency, which quickly translates to making sure that your deployment instance is highly utilized. Which means if you have a number of CPU threads, if you have a GPU, you want to make sure that it's fully utilized as you consider a model deployment to keep your deployment costs low. You also want to make sure that your model scales as you have more number of requests coming in as you have large number of customers accessing your model when it is in production. You also want capabilities such as A/B testing, because you're going to have different versions of models, you want to be able to monitor model performance as it is running in production. In order to address some of these challenges. AWS in collaboration with Facebook developed an open source library called TorchServe to help you deploy Pytorch models to production. And here are some of the key benefits. First and foremost, it's a high performance model server previously had to build your own model servers or put together your own serving solution based on using Pytorch. But with TorchServe, you get a high performance model server that provides low latency APIs right out of the box. It's also optimized for CPUs and GPUs. If you have both these resources, it can take advantage of both hardware resources. It's also super easy to use. In many cases, you get zero code deployment, which means you take your model. If you're deploying image classification and object detection, semantic segmentation models, you just take your model and use some of the default handlers that come with TorchServe. So you can deploy these to production without any code changes. And we'll take a look at this in demo. It's also fully customizable and flexible, staying true to spirit as an open source project and as a Pytorch project. You get multiple model hosting capabilities, you get model versioning. You get server side batching so that your clients don't have to batch requests before sending requesting influence reserved. And you'll get monitoring and logging capabilities. And we'll take a closer look at each of these capabilities. Now, before we jump into how TorchServe works, there are a few different ways to deploy your models with TorchServe and we'll take a look at these three options based on how much flexibility and control you want, versus how much of a managed solution you want. So you don't have to manage infrastructure. So the easiest way to do this is to install TorchServe on an Amazon EC2 instance and run your own model server. And because it's open source, you can build it from source or use PIP or Conda to do this. Many customers prefer using some sort of an orchestration service such as Amazon EKS, so that you can still manage the underlying infrastructure such as the Amazon EC2 instances. But Amazon EKS will make sure that you can easily orchestrate scale out and scale back when you don't have demand. And some customers prefer a fully managed solution, and this is where Amazon SageMaker comes into picture. And with Amazon SageMaker with just a couple of lines shown here, you can create a model and deploy a model to get a highly secure, reliable, scalable endpoint just by specifying how many instances you want. And whether it's a CPU or a GPU instance. So depending on where you are in the spectrum of flexibility control versus a fully managed solution, you can use TorchServe to deploy your model by touch models to production using any of these three options. Now, let's take a look at how this works. Let's say you're a developer or data scientist training a Pytorch model, the first thing you want to do to stand up a TorchServe model server is to run this command shown here, TorchServe start. You can also run this in a Docker container, as we'll see in the demo. So as soon as you stand up TorchServe model server, it stands up a server and offers a few different APIs. The purple boxes here represent a Management API and Inference API and Metrics API, we'll take a look at what these are. You also get logging capabilities out of the box and it needs access to a model store from where it will pick up your models to host. So the first thing you have to do as a developer is to create what is known as a model archive file. A model archive file basically ensures your model definition, also your state information, and packages, all of this into a single archive file that ends with .mar extension. And, once you have this model archive, and it can be created through TorchScript or eager mode. And once you have this, you can invoke the Management API, which is by default hosted on port 8080, but it's fully customizable. And what this lets you do is register new models, scale number of workers associated with the model, you can set different model versions, and so on. And once you register a model, what happens behind the scenes is it'll stand up these endpoints, depending on how much resources you requested, how many CPU threads, you need to host a specific module, or whether it needs a GPU or no. And you can also specify server side batching. How many requests to batch before you run the inference? And once this is hosted, your clients, which could be mobile app, web app and other microservice, they make requests to the inference API, by default hosted at Port 8081, but customizable, again, to get back inference requests. And while this is running, you can always query the metrics API to get metrics information, as these inference requests are being made to the model server. At a high level, this is how TorchServe so works under the hood. But as a user, of course, you don't have to worry about the inner workings, you just have to know what APIs are available to you. Now, as I mentioned, it's really easy to use. And one of the reasons it's so easy to use is as a developer data scientist, if you have models that do image classification, image semantic segmentation, object detection, or text classification. We have thoughts of offers these default model handlers, which gives you default pre-processing and post processing steps, so that you can just deploy a model as you'll see in the demo. But, staying true to the spirit of an open source project, it also can be extended and customized. So here's an example of a handler, and it can be fully customized by you if you have special pre-processing requirements or post processing requirements. And you also have the ability to provide a JSON file which maps index to name so that your inference requests are friendly, and readable. It doesn't have to be probability, so scores but you can get labels back. So that's an option that's baked into TorchServe. Now, we'll spend more time on the different APIs in the demo, but at high level the Management API is Very simple and easy to use here are HTTP requests to do query model status register different models on which to models and scale workers. These APIs are also available with as GRPC APIs. And here's also an example of inference API right at the bottom. So it's fairly easy to use and the best way to experience this is to see this in action in a demo. With that, let's take a look at a quick demo. Here's what I'll cover in a demo. First, we'll take a look at launching TorchServe using a Docker container on an Amazon EC2 instance. We'll go through all the different API's, we'll see how it works, how you can make requests, how you can register model understood models, scale models, and so on. We'll then take a look at an easier way to do this using Amazon SageMaker where you just bring in a model and just with a single line of code host a model. With that, let's jump right into the demo. To get started, the first thing we'll do is head over to github.com/pytorch/serve. TorchServe as I mentioned is fully open source project, and there are a couple of ways to install TorchServe as described here depending on your platform and your preference. The approach we'll be taking in this example is quick start with Docker container, this is the easiest way to get started. Because there's a Docker container readily available for you to download and start running the TorchServe model server. While we are here, I'd also like to point out that there are documentation pages here with more information about how to use TorchServe, and all the different APIs that are available to you, which we'll also take a look at in this demo. So with that, let's quickly head over to our demo here. And what I've done is launched up an Amazon EC2 instance with AWS deep learning AMI. Really, this could be any EC2 instance, with either a CPU or GPU of your choosing. The very first step is to download a TorchServe model. Now as a developer or data scientist, you already likely are training Pytorch models, and you have models that you want to deploy. For the purpose of this example, what I will do is download readily available model on the official Pytorch repository. This is a dense net 161 model and as soon as this model downloads, this is available to me here, right here in this directory. And the first step, if you recall from our slides is to create a model archive file. And to do this, we use a utility called torch model archiver. And this is can be installed along with TorchServe. And it takes a couple of important arguments that I'll go through in detail. The first thing is to specify a model name. And importantly, you can also specify a model version. And this becomes important as you're training other models with improvements and you want to work on them rather than categorize it as a completely different model. Then there are two key pieces of information that you need to specify. The first is the model file itself. This is your definition of your model. And then, you also want to specify the serialized file, which includes state take information, this is the .pth file. Optionally, you can also specify the model store where you want to store the model archive file. And this becomes important because your model server needs to have access to this location. So, it can pick up and host an endpoint with this specific model. You can also optionally provide extra files. And in this case, we have an index to name JSON file so that inference requests can be human readable rather than just course of probabilities. And finally, you can specify a handler. And this is a really easy to use convenience feature because if you're deploying image classification, object detection and text classification, all the pre-processing, post processing is automatically handled for you when you use one of the default handlers. So, let's go ahead and run this. And as soon as you do that, you will see here that in a moment, the torch model archiver utility takes your PTH file and under the model store directory, save it with the .mar file. Now this model is ready to be hosted with TorchServe. To make things interesting, I will do the repeat the same steps with another model and in this case, this is a faster rcnn model. Again, I download the model as well as create the model archive file. So, in this process was we download a model, this is a slightly bigger model, and it is an object detection model. So, it does slightly different things in the sense that it will find and bounding box around the object of interest along with identifying what the object is. And, if we look at the model store directory, I now have two different model archive files. With that, we are now ready to launch a server and host these models. As I mentioned, one of the easiest way to launch a server is to use the TorchServe Docker container. And if you're new to Docker, it's fairly straightforward. You just need to launch an AMI with Docker installed or you use the Amazon AWS deep learning AMI, which includes Docker as part of the AMI. To run this TorchServe container, you need to specify what ports are need to be accessible in this case, 8080 and 8081 because this corresponds to the Management API as well as the Inference API. And you can also specify the model store where the models are hosted. And you specify the type the container AMIs that you want. In this case, I'm using Pytorch/TorchServe latest. But if you want to use a GPU compatible container, then you will say latest-GPU, and then we launch the server. So let's go ahead and run this quickly. And as soon as I do this, I see that my TorchServe server is ready running. What we'll do now is submit request. And while this is up, let's go ahead and register our very first model. And to do that, we will use the Management API and specify the model that we want to register or host as an endpoint. And we provide a couple of different options here. One is initial workers, which is number of workers you want CPU threads, essentially. The bat size, so you can do server side batching, as well as the name of the model archive. And you can also do the same with faster rcnn. Let me go ahead and register both these models by running these commands. And as soon as I do that, I see that both these models are registered. And you can also use the Management API to query what models are currently registered by running a cURL, at 808/models. You see, these two models are currently registered, we just registered them. You can also get more detailed information about a specific model that has been registered. So by specifying the model name, you get more information about the model whether for example, how many workers, what is the batch size and also to support a GPU or not in this case, no, because I am on a CPU instance. Great. So the only thing left to do now is to submit inference requests. And to do that, I'm going to download an image of kitten. As you will see here, I have an image called kittensmall.gpg. And in order to submit requests, I can use the inference API, which is hosted on port 8080. As opposed to the Management API at 8081. These are of course defaults that you can change. And as soon as I submit the request, I get my responses with the top five categories. And I also get the friendly names here because I specified the name indexed in a JSON file while creating the model archive file. I can also unregister models so that they are no longer being hosted. Let me show you what models are first registered. So these two are registered in order to run register, I can submit a delete request and this model has been unregistered and I can verify that I only have the faster rcnn and dense net has been removed because I just unregistered it. Awesome. So this is a quick example of using TorchServe using Amazon EC2 instance if you want a fully managed experience. Let me show you a way to do this using Amazon SageMaker. If you’re new to SageMaker, SageMaker is a fully managed service for every step of the machine learning workflow. All the way from data labeling to hosted notebooks, large scale training, and inference deployment and hosting endpoints, which is what we'll take a look at now. And also for doing data processing, and many other things. I’ll let you explore on your own. So let's head over to Amazon SageMaker studio, where I have a notebook that shows how to deploy a model using TorchServe and hosting it with Amazon SageMaker. So the first order of business is to import the SageMaker SDK, in order to call and make requests to SageMaker to host an endpoint. After that will first download a dense net 161 model just like we saw in the Amazon EC2 example. Once you download the model, you have this model available here and you don't actually have to create the model archive file with SageMaker, it will do it automatically for you, all you need to do is create a TAR file with your model file as well as the PTH file, as described in this cell here. You just create a TAR file, and then you just upload the tar file to Amazon S3. After that, you just have two steps to host a model. The first step is to create a Pytorch model object. And you do this by specifying the location of your model in Amazon S3, where you just upload it and other things. Like what the entry point is the framework version as well as Python version. And then it is just one line of code to deploy your Pytorch model. You just say model.deploy, you specify the number of instances. If you want 100,000 or more. And all the load balancing is automatically handled for you. And the type of instance whether you want a CPU instance or a GPU instance. After that, you can submit an inference request to this endpoint and you will get these results. So that's the quick and easy way to use TorchServe with Amazon SageMaker, you get all the other benefits of SageMaker, including modern monitoring to catch drift detection and so on. So that was a quick demo. Let's head back to our slides. Thanks Shashank. Let's dive into the best practices for production deployment. In order to have success by your models in production, one needs to start with the responsible AI with fairness and Human Centered Design in mind. After the model is trained, one has to go through the optimization phase which involves looking at things like performance versus latency optimizations, top scripting the model for higher throughput, taking into consideration whether the model will be deployed in offline versus real-time and the cost considerations. For the deployment architecture itself based on whether you will be deploying on the cloud or on-prem, you have to look at whether you will be using orchestration solutions, you will be deploying it in a primary versus backup scenario or standalone mode. To get robust solutions which are high resilient, you need to have a robust endpoint, you need to take into consideration things like auto scaling, canary deployments, AB deployments, and then there needs to be a continuous measurement of the right metrics of the model for right performance interpretability and a feedback loop is a continuous refinement. For responsible AI fairness by design and Human Centered Design will play a key role. On the fairness side, one has to consider model bias, data bias look at ways for measuring the skewness of the data. Identify relevant metrics, like false positive rates for class demographics, provide transparency to the users and how exactly will their data be used by the AI models. For explainability provide visibility into the decision making process of why a certain recommendation was made by the model have an inclusive design which takes into consideration all the age groups and demographics. From a home human centered perspective, when you are designing your models think about what will be the impact of the AI decision making on the people who are going to be using that particular application, do they have human recourse available or the solution should not be fully automated in all cases? As an example, if you take a mortgage application, which is using AI systems, and all of a sudden you start seeing high rejection rates for a certain category or a race, the people should have recourse to be able to connect to a human person and get their application reviewed. If you're looking at the computer vision models, you should be in looking at is the model trained across the diverse population. So, is it taking care of people for different skin tones different age groups, so, there is no bias introduced as a result. On the optimization side, you will look at building the model for performance versus latency goes in mind, you can reduce the size of the model using techniques like quantization, pruning, mixed precision training. In order to increase the throughput you can TorchScript models and TorchServe provides a very nice SnakeViz profiler to do this analysis. For large NLP models transformer based models, you may want to deploy that on a GPU for low latency. And now that with the TorchServe 0.3 release, we have support for gRPC, you may want to analyze whether gRPC gives you better performance versus rest. This will be especially relevant for the audio and the video models. Here are some examples of the benefits that we have seen the quantization inside Facebook. So as you can see, with all these different models ResNet, MobileNet, Translate/FairSeq pair, there is zero to very less loss of accuracy when the models are converted to eight. And we have seen two to 4X speed up in the inference speed. This is an example of the SnakeViz profiler that is bundled with TorchServe. So what you're seeing here is an eager mode, both models with eager mode profiling. And, this is the same model with TorchScript profiling and we saw speed up to 4X because the serialization overhead is not there in case of TorchScript models. Other considerations include if you're doing offline predictions, you can do dynamic batching of your predictions. If you're doing online predictions, you should consider asynchronous processing either in a push or a pull mode. If you're storing the results in a database, you could be pulling for it. If there are certain elements which do not change through the day, you can do the pre computed predictions at night and use that for the entire day. So you can introduce techniques like this. For cost optimization side, for offline models, you can look at spot instances. So if an instance goes away, you can have a retry loop in your predictions and try until the next instance comes on, you can use techniques like auto scaling based on metrics for an on demand cluster. So, Shashank talked about the flexibility and the managed deployment. So this is the full spectrum of all the deployment that are supported by TorchServe. So on the on-prem, for example, when you're doing the develop test, you can start off with install from source or Docker containers, you can then deploy it in an ML flow or a cube flow and Marmont with the ML microservices. Now we have support for TF serving and Kubernetes with all auto scaling and canary rollouts. On the CloudFront, you can start with the AWS CloudFormation template that we provide out of the box to quickly get going, you can deploy it as a microservice behind an API gateway, or with the sage maker endpoints with the default inbuilt mechanism inside SageMaker or using your bring your own container. For a fully managed solution, you can consider serverless functions of SageMaker. If you're doing ML flow, you can use the Databricks managed ML flow. And of course with SageMaker, you have the option to do the full canary rollouts, and we have support for EKS as well. For the resiliency side, you do need to make sure that the rope endpoints that you create for your model serving, they are robust. SageMaker provides a mechanism for doing that. And you can do your own custom endpoints as well behind an API gateway. For auto scaling, you need to look at when you're doing a deployment in an orchestration scenario, you can do the auto scaling based on metrics. So this could be SageMaker auto scaling, or on Kubernetes or with TF serving. TorchServe supports multi node deployment. So if you're doing this on EC2, you can do the multi node deployments. We highly encourage you to use the canary rollouts when you are deploying your models for production. So test the new version on a small subset of users and then rolled out to the entire population. In certain cases, when you're building a new model, you may want to do things like Shadow inference when the model is getting trained. So that the new version of the model is highly performant when you deploy it in production. When you have multiple models and you need to choose between which model is going to give you the best results then you will use techniques like AB testing and TorchServe allows you to support deploy all these models. For the measurement side, you should define performance metrics such as accuracy while designing the AI service itself. And this will be very use case specific, TorchServe has support for custom metrics and you can log them on CloudWatch or parameters and you can monitor your model performance. With B0.3 release, we added support for model interpretability with Captum. So please experiment with that and do the explainability analysis and do the feedback loop. If the model accuracy drops over time, do the analysis like concept drift analysis whether the model data is becoming stale or the model is become old and continue to refine your model. So, this is what the loop looks like. So you will start off by understanding the requirements of the product or the application where the AI model will be rolled out. Get an alignment across all the stakeholders and define what are the metrics that will be used for monitoring, define the measurement criteria and the mitigation criteria and then do the continuous monitoring and refine the model as you go around your service. In the future versions, we are rolling out support for ensemble models. We are adding performance improvements with the memory and better resource utilization for better scalability. C++ inference back end is coming for the lower latency. We are adding support for AWS Inferentia AI accelerator chips and enhanced profiling tools will be provided. I’ll hand it back to Shashank now. Here are some resources for you to get started with TorchServe, TorchServe is an open source project. So, head out to GitHub Pytorch repository and take a look at TorchServe for more information about how do you start TorchServe and documentation and so on. We also did a launch blog post that has more information about what is TorchServe and how to use it with an example it shows how to deployed SageMaker. There's also video that are recorded with more information on how to run TorchServe. If you want to know more about SageMaker here are some documentation and samples linked on GitHub that shows you how to train and deploy by torch models. And finally, if you're interested in the topic of deployment and accelerating model deployment in this talk, we talk about how to accelerate models for inference using CPS and GPS and elastic inference and AWS Inferentia. Feel free to watch the stock and also check out the related blog posts that more information on this topic. With that, I want to thank you for taking time to listen to me and Greeta. If you have questions, feel free to reach out to me on Twitter, LinkedIn or Medium. Here are some links. Thanks again for listening and please take a few minutes to fill out the survey. Thank you.
Info
Channel: AWS Events
Views: 8,763
Rating: undefined out of 5
Keywords: re:Invent 2020, Amazon, AWS re:Invent, OPN306, Open Source, Amazon Elastic, Compute Cloud, (Amazon EC2), Amazon SageMaker, Facebook
Id: 6xaMskcWmXY
Channel Id: undefined
Length: 32min 49sec (1969 seconds)
Published: Fri Feb 05 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.