AWS Summit ANZ 2022 - End-to-end MLOps for architects (ARCH3)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi, welcome to this session on end-to-end MLOps for architects. My name is Sara van de Moosdijk, but you can call me Moose. I'm a Senior AI/ML Partner Solutions Architect at AWS. Now, my goal for this session today is to help architects and developers, especially those of you not specialised in machine learning, to design an MLOps architecture for your organisation. I will introduce the different components in an effective MLOps setup, and explain why these components are necessary without diving too deep into the details which only data scientists need to know. Now, before watching this session, you should be comfortable with architecting on AWS, and know how to use popular services like S3, Lambda, EventBridge, CloudFormation, and so on. If you're unfamiliar with these services, I recommend watching some of the other architecture sessions and reading up on these topics, then come back to watch the session. It will also help if you have a basic understanding of machine learning and how it works. Okay, let's look at what you can expect. I'll start off the session with a brief overview of the challenges many companies face when deploying machine learning models and maintaining them in production. We will then briefly define MLOps, before diving straight into some architecture diagrams. Specifically, I have designed the architecture diagrams according to t-shirt sizes from small to large. This will allow you to choose your starting point based on the size and maturity of your organisation. And finally, I'll end the session with some advice for starting your own MLOps journey. First, I want to quickly go over the machine learning process to make sure that we're all on the same page. Normally, you'd start your machine learning process because you have a business problem which needs to be solved, and you've determined that machine learning is the correct solution. Then a data scientist will spend quite a bit of time collecting the data required, integrating data from various sources, cleaning the data, and analysing the data. Next, the data scientists will start the process of engineering features, training and tuning different machine learning models and evaluating the performance of these models. And then based on these results, the data scientist might go back to collect more data or perform additional data cleaning steps. But assuming that the models are performing well, he or she would then go ahead and deploy the model so it can be used to generate predictions. The final step and certainly a crucial one is to monitor the model that is in production. Much like a new car which depreciates in value as soon as you drive it off the lot, a machine learning model is out of date, as soon as you've trained it. The world is constantly changing and evolving, so the older your model gets, the worse it gets at making predictions. By monitoring the quality of your model, you will know when it's time to retrain or perhaps gather new data for it. Now, every data scientist will follow a process along these lines and when they're just starting out this process will likely be entirely manual. But as businesses embrace machine learning across their organisations, manual workflows for building, training, deploying tend to become bottlenecks to innovation. Even a single machine learning model in production needs to be monitored, managed and retrained to maintain the quality of its predictions. Without the right operational practices in place, these challenges can negatively impact data scientist productivity, model performance and costs. So to illustrate what I mean let's take a look at some architecture diagrams. Here I have a pretty standard data science setup. The data scientist has been given access to an AWS account within a SageMaker Studio domain, where they can use Jupyter Notebooks to develop their machine learning models. Data might be pulled from S3, RDS, Redshift, Glue, any number of data related AWS services. The models produced by the data scientist are then stored in an S3 bucket. So far, so good. Unfortunately, this is often where it stops. Gartner estimates that only 53% of machine learning POCs actually make it into production. And there are various reasons for this. Often there's a misalignment between strategic objectives of the company and the machine learning models being built by the data scientists. There might be a lack of communication between DevOps, security, legal, IT, and the data scientists, which can also be a common challenge blocking models from reaching production. And finally, if the company already has a few models in production, a data scientist team can struggle to maintain those existing models, while also pushing out new models. But what if the model does make it into production? Let's assume the data scientist spins up a SageMaker endpoint to host the model. And the developer of an application is able to connect to this model and generate predictions through API Gateway connecting to a Lambda function, which calls the SageMaker endpoint. So what challenges do you see with this architecture? Well first, any changes to the machine learning model requires manual actions by the data scientist in the form of re- running cells in a Jupyter Notebook, right. Second, the code which the data scientist produces is stuck in these Jupyter Notebooks, which are difficult to version and difficult to automate. Third, the data scientist might have forgotten to turn on auto- scaling for the SageMaker endpoint, so it cannot adjust capacity according to the number of requests coming in. And finally, there's no feedback loop. If the quality of the model deteriorates, you would only find out through complaints from disgruntled users. These are just some of the challenges which can be avoided with a proper MLOps setup. So what is MLOps? Well, MLOps is a set of operational practices to automate and standardise model building training, deployment, monitoring, management, and governance. It can help companies streamline the end-to- end machine learning lifecycle, and boost productivity of data scientists and MLOps teams, while maintaining high model accuracy, and enhancing security and compliance. So the key phrase in that previous definition is operational practices. MLOps, similar to DevOps, is more than just a set of technologies or services. You need the right people, with the right skills, following the same standardised processes to successfully operate machine learning at scale. The technology exists to facilitate these processes and make the job easier for the people. Now in this session, I will focus on the technology and specifically which AWS services we can use to build a successful setup. But I want you to keep in mind that the architectures provided in this session will only work if you have the right teams, and if those teams are willing to establish and follow MLOps processes. Now if you want to learn more about the people and process aspects of MLOps, I will include links to useful resources at the end of this session. So without further ado, let's dive deep with some architecture diagrams. Now MLOps can be quite complicated, with lots of features and technologies which you could choose to adopt, but you don't have to adopt all of it immediately. So to start off, I'll give you an example of a minimal MLOps setup. This would be suitable for a small company or a small data science team of one to three people working on just a couple of use cases. So let's take a look. We'll start off with the same architecture we looked at previously, only reduced in size so I can create more space on the slide. A data scientist accesses Jupyter Notebooks through SageMaker Studio, accesses data from any of various data sources, and stores any machine learning models they create in S3. One of the challenges I mentioned previously is that the code is stuck in Jupyter notebooks and can be difficult to version and automate. So the first step would be to add more versioning to this architecture. You can use CodeCommit or any other Git- based repository to store code. And you can use Amazon Elastic Container Registry or ECR to store Docker containers, thereby versioning the environments which were used to train the machine learning models. By versioning the code, the environments and the model artifacts, you improve your ability to reproduce models and collaborate with others. Next, let's talk about automation. Another challenge I mentioned previously is that the data scientists are manually re-training models instead of focussing on developing new models. To solve this, you want to set up automatic re-training pipelines. In this architecture, I use SageMaker Pipelines, but you could also use Step Functions or Airflow to build these repeatable workflows. The re-training pipeline built by the data scientist or by a machine learning engineer, will use the version code and environments to perform data pre-processing, model training, model verification, and eventually save the new model artifacts to S3. It can use various services to complete these steps including SageMaker processing or training jobs, EMR, or Lambda. But in order to automate this pipeline, we need a trigger. One option is to use EventBridge to trigger the pipeline based on a schedule. Another option is to have someone manually trigger the pipeline. Both triggers are useful in different contexts and I'll introduce more triggers as we progress through these slides. So now that we have an automated re-training pipeline, I want to introduce another important concept in MLOps, and that's the model registry. While S3 provides some versioning and object locking functionality, which is useful for storing different models, a model registry helps to manage these models and their versions. SageMaker model registry allows you to store metadata alongside your models, including the values of hyperparameters and evaluation metrics, or even the bias and explainability reports. This enables you to quickly view and compare different versions of a model and to approve or reject a model version for production. Now the actual artifacts are still stored in S3, but model registry sits on top of that as an additional layer. Finally, we reach the deployment stage. At first glance, this might look very different from what we saw earlier in the session. But the setup is actually very similar. I still have machine learning models deployed on real-time SageMaker endpoints connected to Lambda and API Gateway to communicate with an application. The main difference now is that I have autoscaling set up for my SageMaker endpoints. So if there's an unexpected spike in users, the endpoints can scale up to handle the requests and scale back down when the usage falls. Now one nice feature of SageMaker endpoints is that you can replace the machine learning model without endpoint downtime. Since I now have an automated re-training pipeline creating new models, and a model registry where I can approve models, it would be best if the deployment of the new models is automated as well. I can achieve this by building a Lambda function, which triggers when a new model is approved to fetch that model, and then update the endpoint with it. So now we have connected all the pieces and there's one final feature that I will take advantage of. Not only can I update the machine learning models hosted by the endpoints, but I can actually do so gradually using a canary deployment. This means that a small portion of the user requests will be diverted to the new model, and any errors or issues will trigger a Cloud- Watch alarm to inform me. Over time, the number of requests sent to the new model will increase until the new model gets 100% of the traffic. So I hope this architecture makes sense. I started with a very basic setup, and by adding a few features and services, I now have a serviceable MLOps setup. My deployment strategy is more robust by using auto- scaling and canary deployment, my data scientists save time by automating model training, and every artefact is properly versioned. But as your data science team grows, this architecture won't be sufficient. So let's look at a slightly more complicated architecture. The next architecture will be more suitable for a growing data science team of between three to 10 data scientists working on several different use cases at a larger company. So again, let's start with the basics. Our data scientists work in notebooks through SageMaker Studio, pulling from various data sources and versioning their code environments and model artifacts. This should look familiar. Also bring back the automated re-training pipeline. Nothing has changed here, I've only made it smaller to create more room on the slide. And finally, I'll bring back Event- Bridge to schedule the re-training pipeline and model registry for storing model metadata and approving model versions. All of this is exactly the same as in the previous architecture diagram. So what about deployment? Well, this is where things change a little. So I have the same deployment setup with SageMaker endpoints and an autoscaling group connected to Lambda and API Gateway to allow users to submit inference requests. However, these deployment services now sit in a separate AWS account. A multi- account strategy is highly recommended, because this allows you to separate different business units, easily define separate restrictions for important production workloads, and have a fine-grained view of the costs incurred by each component of your architecture. The different accounts are best managed through AWS Organizations. Now, data scientists should not have access to the production account. This reduces the chance of mistakes being made on that account, which directly affects your users. In fact, a multi- account strategy for machine learning usually has a separate staging account alongside the production account. Any new models are first deployed to the staging account, tested and only then deployed on the production account. So if the data scientist cannot access these accounts, clearly, the deployment must happen automatically. All of the services deployed into the staging and production accounts are set up automatically using CloudFormation, controlled by CodePipeline in the development account. The next step is to set up a trigger for CodePipeline. And we can do so using Event- Bridge. So when a model version is approved in model registry, this will generate an event which can be used to trigger deployment via CodePipeline. So now everything's connected again, and this is starting to look like a proper MLOps setup. But I'm sure you've noticed I have plenty of space left on this slide. So let's add another feature which becomes crucial when you have multiple models running in production for extended periods of time - that's model monitor. The goal of monitoring machine learning models in production is to detect a change in behaviour or accuracy. To start, I enabled data capture on the endpoints in the staging and production accounts. This captures the incoming requests and outgoing inference results and stores them in S3 buckets. If you have a model monitoring use case, which doesn't require labelling the incoming requests, then you could run the whole process directly on your staging and production accounts. But in this case, I assume the data needs to be combined with labels or other data that's on the development account. So I use S3 replication to move the data onto an S3 bucket in the development account. Now, in order to tell if the behaviour of the model or the data has changed, we need something to compare it to. That's where the model baseline comes in. During the training process as part of the automated re-training pipeline, we can generate a baseline dataset, which records the expected behaviour of the data and the model. So that gives me all the components I need to set up Sage- Maker model monitor, which will compare the two datasets and generate a report. The final step in this architecture is to take action based on the results of the model monitoring report. And we can do this by sending an event to EventBridge to trigger the re-training pipeline when a significant change has been detected. And that's it for the medium MLOps architecture! It contains a lot of the same features used in the small architecture, but it expands to a multi-account setup, and adds model monitoring for extra quality checks on the models in production. Hopefully, you're now wondering what a large MLOps architecture looks like and how I can possibly fit more features onto a single slide. So let's take a look at that now. This architecture is suitable for companies with large data science teams of 10 or more people and with machine learning integrated throughout the business. Of course, I start with the same basic setup I had last time but reduced in size again. The data scientist is still using SageMaker Studio through a development account, and stores model artifacts and code in S3 and CodeCommit respectively. The data sources are also present, but data is now stored in a separate account. It's a common strategy to have your data lakes set up in one account with fine-grained access controls to determine which datasets can be accessed by resources in other accounts. Really, the larger a company becomes the more AWS accounts they tend to use, all managed through AWS Organizations. So let's continue this trend by bringing back the automated re-training pipeline in a separate operations account. And let's bring back model registry as well in yet another account. All of the components are the same as in the small and medium architecture diagrams, but just split across more accounts. The operations account is normally used for any automated workflows which don't require manual intervention by the data scientists. It's also good practice to store all of your artifacts in a separate artefact account like I have here for model registry. Again, this is an easy way to prevent data scientists from accidentally changing production artifacts. Next, let's bring back the production and staging accounts with the deployment setup. This is exactly the same as in the previous architecture, just reduced in size. The infrastructure in the production and staging accounts is still set up automatically through CloudFormation in CodePipeline, but CodePipeline sits in a CI/ CD account. Note that I have built this diagram based on account structures I have seen organisations use but your organisation might use a different account setup and that's totally fine. Use this diagram as an example and adjust it to your structure and your needs. Now, let's connect our model registry to CodePipeline by using EventBridge exactly the same as in the previous architecture. And now we have all the pieces connected again. But I don't know if you noticed, but one of the basic building blocks is still missing in this picture. Hopefully you spotted it - ECR disappeared for a little while. So let's bring it back by placing it in the artefact account because environments, especially production environments are artifacts which need to be protected. There's one more change I want to make to my use of ECR here. In the previous architecture diagrams, I assumed that data scientists were building Docker containers and registering these containers and ECR manually. This process can be simplified and indeed automated using CodeBuild and CodePipeline. The data scientist or machine learning engineer can still write the Docker file, but the building and registration of the container is performed automatically. This saves even more time, so data scientists can focus on what they do best. Of course, in the previous architecture, I use model monitor to trigger model re-training if significant changes in model behaviour were detected. So let's bring that back as well, starting with the data capture in the staging and production accounts, followed by data replication into the operations account. As before, model monitor will need a baseline to compare performance and the generation of this baseline can be a step in SageMaker Pipelines. Finally, I'll bring back model monitor to generate reports on drift and trigger re-training if necessary. This leaves us with all of the components I had in the medium MLOps diagram. But there's two more features that I want to introduce. The first is Sage- Maker feature store, which sits in the artefact account because features are artifacts which can be reused. If you remember the basic data science workflow from the beginning of the session, data scientists will normally perform feature engineering before training a model, and it has a large impact on model performance. In large companies, there's a good chance that data scientists will be working on separate use cases which rely on the same dataset. A feature store allows data scientists to take advantage of features created by others. It reduces their workload and also ensures consistency in the features that are created from a dataset. The final component I want to introduce a SageMaker Clarify. Clarify can be used by data scientists in the development phase to identify bias and datasets and to generate explainability reports for models. This technology is important for responsible AI. Now similar to model monitor, Clarify can also be used to generate baseline bias and explainability reports, which can then be compared to the behaviour of the model in the endpoint. If Clarify finds that bias is increasing or the explainability results are changing, it can trigger a re-training of the model. Now both feature store and Clarify can be introduced much earlier in the medium or even the small MLOps architectures. It really depends on the needs of your business. And I hope you can use these example architectures to design an architecture which works for you. Now, the architecture diagrams in this session rely heavily on different components offered by Amazon SageMaker. SageMaker provides purpose- built tools and built-in integrations with other AWS services so you can adopt MLOps practices across your organisation. Using Amazon SageMaker, you can build CI/CD pipelines to reduce model management overhead, automate machine learning workflows to accelerate data preparation, model building and model training, monitor the quality of models by automatically detecting bias model drift and concept drift, and automatically track lineage of code datasets and model artifacts for governance. But if there's one thing I want you to take away from this session, it should be this: MLOps is a journey, you don't have to immediately adopt every feature available in a complicated architecture design. Start with the basic steps to integrate versioning and automation. Evaluate all the features I introduced in this session, and order them according to the needs of your business, then start adopting them as and when it's needed. The architecture diagrams I presented in the session are not the only way to implement MLOps, but I hope they'll provide some inspiration to you as an architect. So to help you get started, I've collected some useful resources and placed the links on this slide. You should be able to download a copy of these slides so you can access these links. The resources on this page will not only provide advice on the technology behind MLOps but also on the people and processes which we discussed briefly at the start. If you're interested in any other topic related to AWS Cloud, I recommend checking out Skill Builder online learning centre. It offers over 500 free digital courses for any level of experience. You should also consider getting certified through AWS Certifications. If you enjoyed the topic of this particular session, I'd recommend checking out the Solutions Architect associate and professional certifications, as well as the machine learning specialty certification. And that's all I have for today. Thanks for taking the time to listen to me talk about MLOps, and I hope this content helps you in upcoming projects. I just have one final request and that's to complete the session survey. It's the only way for me to know if you enjoyed this session and it only takes a minute. I hope you have a wonderful day and enjoy the rest of Summit!
Info
Channel: AWS Events
Views: 44,701
Rating: undefined out of 5
Keywords: AWS, Events, Webinars, Amazon Web Services, AWS Cloud, Amazon Cloud, AWS re:Invent, AWS Summit, AWS re:Inforce, AWSome Day Online, aws tutorial, aws demo, aws webinar, Machine Learning, Financial Services, Professional Services, Solution or Systems Architect, IT Professional or Technical Manager, Amazon SageMaker, Amazon SageMaker Studio
Id: UnAN35gu3Rw
Channel Id: undefined
Length: 23min 1sec (1381 seconds)
Published: Thu Sep 01 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.