Hi, welcome to this session on
end-to-end MLOps for architects. My name is Sara van de
Moosdijk, but you can call me Moose. I'm a Senior AI/ML
Partner Solutions Architect at AWS. Now, my goal for this
session today is to help architects and developers,
especially those of you not specialised in machine learning,
to design an MLOps architecture for your organisation. I will
introduce the different components in an effective
MLOps setup, and explain why these components are necessary
without diving too deep into the details which only data
scientists need to know. Now, before watching this session,
you should be comfortable with architecting on AWS, and know
how to use popular services like S3, Lambda, EventBridge,
CloudFormation, and so on. If you're unfamiliar with these
services, I recommend watching some of the other architecture
sessions and reading up on these topics, then come back to watch
the session. It will also help if you have a basic
understanding of machine learning and how it works. Okay,
let's look at what you can expect. I'll start off the
session with a brief overview of the challenges many companies
face when deploying machine learning models and maintaining
them in production. We will then briefly define MLOps, before
diving straight into some architecture diagrams.
Specifically, I have designed the architecture diagrams
according to t-shirt sizes from small to large. This will allow
you to choose your starting point based on the size and
maturity of your organisation. And finally, I'll end the
session with some advice for starting your own MLOps
journey. First, I want to quickly go over the machine
learning process to make sure that we're all on the same page.
Normally, you'd start your machine learning process because
you have a business problem which needs to be solved, and
you've determined that machine learning is the correct
solution. Then a data scientist will spend quite a bit of time
collecting the data required, integrating data from various
sources, cleaning the data, and analysing the data. Next, the
data scientists will start the process of engineering features,
training and tuning different machine learning models and
evaluating the performance of these models. And then based on
these results, the data scientist might go back to
collect more data or perform additional data cleaning steps.
But assuming that the models are performing well, he or she would
then go ahead and deploy the model so it can be used to
generate predictions. The final step and certainly a
crucial one is to monitor the model that is in production.
Much like a new car which depreciates in value as soon as
you drive it off the lot, a machine learning model is out of
date, as soon as you've trained it. The world is constantly
changing and evolving, so the older your model gets, the worse
it gets at making predictions. By monitoring the quality of
your model, you will know when it's time to retrain or perhaps
gather new data for it. Now, every data scientist will follow
a process along these lines and when they're just starting out
this process will likely be entirely manual. But as
businesses embrace machine learning across their
organisations, manual workflows for building, training, deploying
tend to become bottlenecks to innovation. Even a single
machine learning model in production needs to be
monitored, managed and retrained to maintain the quality of its
predictions. Without the right operational practices in place,
these challenges can negatively impact data scientist
productivity, model performance and costs. So to illustrate what
I mean let's take a look at some architecture diagrams. Here I
have a pretty standard data science setup. The data
scientist has been given access to an AWS account within a SageMaker
Studio domain, where they can use Jupyter Notebooks to
develop their machine learning models. Data might be pulled
from S3, RDS, Redshift, Glue, any number of data related AWS
services. The models produced by the data scientist are then
stored in an S3 bucket. So far, so good. Unfortunately, this is
often where it stops. Gartner estimates that only 53% of
machine learning POCs actually make it into production. And
there are various reasons for this. Often there's a
misalignment between strategic objectives of the company and
the machine learning models being built by the data
scientists. There might be a lack of communication between
DevOps, security, legal, IT, and the data scientists,
which can also be a common challenge blocking models from
reaching production. And finally, if the company already
has a few models in production, a data scientist team can
struggle to maintain those existing models, while also
pushing out new models. But what if the model does make it
into production? Let's assume the data scientist spins up a
SageMaker endpoint to host the model. And the developer of an
application is able to connect to this model and generate
predictions through API Gateway connecting to a Lambda function,
which calls the SageMaker endpoint. So what challenges do
you see with this architecture? Well first, any changes to the
machine learning model requires manual actions by the data
scientist in the form of re- running cells in a Jupyter
Notebook, right. Second, the code which the data scientist
produces is stuck in these Jupyter Notebooks, which are
difficult to version and difficult to automate. Third,
the data scientist might have forgotten to turn on auto-
scaling for the SageMaker endpoint, so it cannot adjust
capacity according to the number of requests coming in. And
finally, there's no feedback loop. If the quality of the
model deteriorates, you would only find out through complaints
from disgruntled users. These are just some of the challenges
which can be avoided with a proper MLOps setup. So what is
MLOps? Well, MLOps is a set of operational practices to
automate and standardise model building training, deployment,
monitoring, management, and governance. It can help
companies streamline the end-to- end machine learning lifecycle,
and boost productivity of data scientists and MLOps teams,
while maintaining high model accuracy, and enhancing security
and compliance. So the key phrase in that previous
definition is operational practices. MLOps, similar to
DevOps, is more than just a set of technologies or services. You
need the right people, with the right skills, following the same
standardised processes to successfully operate machine
learning at scale. The technology exists to facilitate
these processes and make the job easier for the people. Now in
this session, I will focus on the technology and specifically
which AWS services we can use to build a successful setup. But I
want you to keep in mind that the architectures provided in
this session will only work if you have the right teams, and if
those teams are willing to establish and follow MLOps
processes. Now if you want to learn more about the people and
process aspects of MLOps, I will include links to useful
resources at the end of this session. So without further ado,
let's dive deep with some architecture diagrams. Now
MLOps can be quite complicated, with lots of features and
technologies which you could choose to adopt, but you don't
have to adopt all of it immediately. So to start off,
I'll give you an example of a minimal MLOps setup. This would
be suitable for a small company or a small data science team of
one to three people working on just a couple of use cases. So
let's take a look. We'll start off with the same architecture
we looked at previously, only reduced in size so I can create
more space on the slide. A data scientist accesses Jupyter
Notebooks through SageMaker Studio, accesses data from any
of various data sources, and stores any machine learning
models they create in S3. One of the challenges I mentioned
previously is that the code is stuck in Jupyter notebooks and
can be difficult to version and automate. So the first step
would be to add more versioning to this architecture. You can
use CodeCommit or any other Git- based repository to store code.
And you can use Amazon Elastic Container Registry or ECR to
store Docker containers, thereby versioning the environments
which were used to train the machine learning models. By
versioning the code, the environments and the model
artifacts, you improve your ability to reproduce models and
collaborate with others. Next, let's talk about automation.
Another challenge I mentioned previously is that the data
scientists are manually re-training models instead of
focussing on developing new models. To solve this, you want
to set up automatic re-training pipelines. In this architecture,
I use SageMaker Pipelines, but you could also use Step
Functions or Airflow to build these repeatable workflows. The
re-training pipeline built by the data scientist or by a machine
learning engineer, will use the version code and environments to
perform data pre-processing, model training, model
verification, and eventually save the new model artifacts to
S3. It can use various services to complete these steps
including SageMaker processing or training jobs, EMR, or
Lambda. But in order to automate this pipeline, we need a
trigger. One option is to use EventBridge to trigger the
pipeline based on a schedule. Another option is to have
someone manually trigger the pipeline. Both triggers are
useful in different contexts and I'll introduce more triggers
as we progress through these slides. So now that we have an
automated re-training pipeline, I want to introduce another
important concept in MLOps, and that's the model registry. While
S3 provides some versioning and object locking functionality,
which is useful for storing different models, a model
registry helps to manage these models and their versions.
SageMaker model registry allows you to store metadata alongside your
models, including the values of hyperparameters and evaluation
metrics, or even the bias and explainability reports. This
enables you to quickly view and compare different versions of a
model and to approve or reject a model version for production.
Now the actual artifacts are still stored in S3, but model
registry sits on top of that as an additional layer. Finally, we reach the
deployment stage. At first glance, this might look very
different from what we saw earlier in the session. But the
setup is actually very similar. I still have machine learning
models deployed on real-time SageMaker endpoints connected
to Lambda and API Gateway to communicate with an application.
The main difference now is that I have autoscaling set up for
my SageMaker endpoints. So if there's an unexpected spike in
users, the endpoints can scale up to handle the requests and
scale back down when the usage falls. Now one nice feature of
SageMaker endpoints is that you can replace the machine learning
model without endpoint downtime. Since I now have an automated
re-training pipeline creating new models, and a model registry
where I can approve models, it would be best if the deployment
of the new models is automated as well. I can achieve this by
building a Lambda function, which triggers when a new model
is approved to fetch that model, and then update the endpoint
with it. So now we have connected all the pieces and
there's one final feature that I will take advantage of. Not only
can I update the machine learning models hosted by the
endpoints, but I can actually do so gradually using a canary
deployment. This means that a small portion of the user
requests will be diverted to the new model, and any errors or
issues will trigger a Cloud- Watch alarm to inform me. Over
time, the number of requests sent to the new model will
increase until the new model gets 100% of the traffic. So I
hope this architecture makes sense. I started with a very
basic setup, and by adding a few features and services, I now
have a serviceable MLOps setup. My deployment strategy is
more robust by using auto- scaling and canary deployment,
my data scientists save time by automating model training, and
every artefact is properly versioned. But as your data
science team grows, this architecture won't be
sufficient. So let's look at a slightly more complicated
architecture. The next architecture will be more
suitable for a growing data science team of between three to
10 data scientists working on several different use cases at a
larger company. So again, let's start with the basics. Our data
scientists work in notebooks through SageMaker Studio,
pulling from various data sources and versioning their
code environments and model artifacts. This should look
familiar. Also bring back the automated re-training pipeline.
Nothing has changed here, I've only made it smaller to create
more room on the slide. And finally, I'll bring back Event-
Bridge to schedule the re-training pipeline and model
registry for storing model metadata and approving model
versions. All of this is exactly the same as in the previous
architecture diagram. So what about deployment? Well, this is
where things change a little. So I have the same deployment setup
with SageMaker endpoints and an autoscaling group connected to
Lambda and API Gateway to allow users to submit inference
requests. However, these deployment services now sit in a
separate AWS account. A multi- account strategy is highly
recommended, because this allows you to separate different
business units, easily define separate restrictions for
important production workloads, and have a fine-grained view of
the costs incurred by each component of your architecture.
The different accounts are best managed through AWS
Organizations. Now, data scientists should not have
access to the production account. This reduces the chance
of mistakes being made on that account, which directly affects
your users. In fact, a multi- account strategy for machine
learning usually has a separate staging account alongside the
production account. Any new models are first deployed to the
staging account, tested and only then deployed on the production
account. So if the data scientist cannot access these
accounts, clearly, the deployment must happen
automatically. All of the services deployed into the
staging and production accounts are set up automatically using
CloudFormation, controlled by CodePipeline in the development
account. The next step is to set up a trigger for CodePipeline.
And we can do so using Event- Bridge. So when a model version
is approved in model registry, this will generate an event
which can be used to trigger deployment via CodePipeline. So
now everything's connected again, and this is starting to
look like a proper MLOps setup. But I'm sure you've
noticed I have plenty of space left on this slide. So let's add
another feature which becomes crucial when you have multiple
models running in production for extended periods of time - that's
model monitor. The goal of monitoring machine learning
models in production is to detect a change in behaviour or
accuracy. To start, I enabled data capture on the endpoints in
the staging and production accounts. This captures the
incoming requests and outgoing inference results and stores
them in S3 buckets. If you have a model monitoring use case,
which doesn't require labelling the incoming requests, then you
could run the whole process directly on your staging and
production accounts. But in this case, I assume the data needs to
be combined with labels or other data that's on the development
account. So I use S3 replication to move the data onto an S3
bucket in the development account. Now, in order to tell if the
behaviour of the model or the data has changed, we need
something to compare it to. That's where the model baseline
comes in. During the training process as part of the automated
re-training pipeline, we can generate a baseline dataset,
which records the expected behaviour of the data and the
model. So that gives me all the components I need to set up Sage-
Maker model monitor, which will compare the two datasets and
generate a report. The final step in this architecture is to
take action based on the results of the model monitoring report.
And we can do this by sending an event to EventBridge to trigger
the re-training pipeline when a significant change has been
detected. And that's it for the medium MLOps
architecture! It contains a lot of the same features used in the
small architecture, but it expands to a multi-account
setup, and adds model monitoring for extra quality checks on the
models in production. Hopefully, you're now wondering what a
large MLOps architecture looks like and how I can possibly fit
more features onto a single slide. So let's take a look at
that now. This architecture is suitable for companies with
large data science teams of 10 or more people and with machine
learning integrated throughout the business. Of course, I start
with the same basic setup I had last time but reduced in size
again. The data scientist is still using SageMaker Studio
through a development account, and stores model artifacts and
code in S3 and CodeCommit respectively. The data sources
are also present, but data is now stored in a separate
account. It's a common strategy to have your data lakes set up
in one account with fine-grained access controls to determine
which datasets can be accessed by resources in other accounts.
Really, the larger a company becomes the more AWS accounts
they tend to use, all managed through AWS Organizations. So
let's continue this trend by bringing back the automated
re-training pipeline in a separate operations account. And
let's bring back model registry as well in yet another account.
All of the components are the same as in the small and medium
architecture diagrams, but just split across more accounts. The
operations account is normally used for any automated workflows
which don't require manual intervention by the data
scientists. It's also good practice to store all of your
artifacts in a separate artefact account like I have here for
model registry. Again, this is an easy way to prevent data
scientists from accidentally changing production artifacts.
Next, let's bring back the production and staging accounts
with the deployment setup. This is exactly the same as in the
previous architecture, just reduced in size. The
infrastructure in the production and staging accounts is still
set up automatically through CloudFormation in CodePipeline,
but CodePipeline sits in a CI/ CD account. Note that I have
built this diagram based on account structures I have seen
organisations use but your organisation might use a
different account setup and that's totally fine. Use this
diagram as an example and adjust it to your structure and your
needs. Now, let's connect our model registry to CodePipeline
by using EventBridge exactly the same as in the previous
architecture. And now we have all the pieces connected again.
But I don't know if you noticed, but one of the basic building
blocks is still missing in this picture. Hopefully you spotted
it - ECR disappeared for a little while. So let's bring it back by
placing it in the artefact account because environments,
especially production environments are artifacts which
need to be protected. There's one more change I want to make
to my use of ECR here. In the previous architecture diagrams,
I assumed that data scientists were building Docker containers
and registering these containers and ECR manually. This process
can be simplified and indeed automated using CodeBuild and
CodePipeline. The data scientist or machine learning
engineer can still write the Docker file, but the building
and registration of the container is performed
automatically. This saves even more time, so data scientists
can focus on what they do best. Of course, in the previous
architecture, I use model monitor to trigger model
re-training if significant changes in model behaviour were
detected. So let's bring that back as well, starting with the
data capture in the staging and production accounts, followed by
data replication into the operations account. As before,
model monitor will need a baseline to compare performance
and the generation of this baseline can be a step in
SageMaker Pipelines. Finally, I'll bring back model monitor to
generate reports on drift and trigger re-training if necessary.
This leaves us with all of the components I had in the medium
MLOps diagram. But there's two more features that I want to
introduce. The first is Sage- Maker feature store, which sits
in the artefact account because features are artifacts which can
be reused. If you remember the basic data science workflow from
the beginning of the session, data scientists will normally
perform feature engineering before training a model, and it
has a large impact on model performance. In large companies, there's a good chance that data
scientists will be working on separate use cases which rely on
the same dataset. A feature store allows data scientists to
take advantage of features created by others. It reduces
their workload and also ensures consistency in the features that
are created from a dataset. The final component I want to
introduce a SageMaker Clarify. Clarify can be used by data
scientists in the development phase to identify bias and
datasets and to generate explainability reports for
models. This technology is important for responsible AI. Now
similar to model monitor, Clarify can also be used to generate
baseline bias and explainability reports, which can then be
compared to the behaviour of the model in the endpoint. If
Clarify finds that bias is increasing or the explainability
results are changing, it can trigger a re-training of the
model. Now both feature store and Clarify can be introduced
much earlier in the medium or even the small MLOps
architectures. It really depends on the needs of your business.
And I hope you can use these example architectures to design
an architecture which works for you. Now, the architecture
diagrams in this session rely heavily on different components
offered by Amazon SageMaker. SageMaker provides purpose-
built tools and built-in integrations with other AWS
services so you can adopt MLOps practices across your
organisation. Using Amazon SageMaker, you can build CI/CD
pipelines to reduce model management overhead, automate
machine learning workflows to accelerate data preparation,
model building and model training, monitor the quality of
models by automatically detecting bias model drift and
concept drift, and automatically track lineage of code datasets
and model artifacts for governance. But if there's one
thing I want you to take away from this session, it should be
this: MLOps is a journey, you don't have to immediately adopt
every feature available in a complicated architecture design.
Start with the basic steps to integrate versioning and
automation. Evaluate all the features I introduced in this
session, and order them according to the needs of your
business, then start adopting them as and when it's needed.
The architecture diagrams I presented in the session are not
the only way to implement MLOps, but I hope they'll provide
some inspiration to you as an architect. So to help you get
started, I've collected some useful resources and placed the
links on this slide. You should be able to download a copy of
these slides so you can access these links. The resources on
this page will not only provide advice on the technology behind
MLOps but also on the people and processes which we
discussed briefly at the start. If you're interested in any
other topic related to AWS Cloud, I recommend checking out
Skill Builder online learning centre. It offers over 500 free
digital courses for any level of experience. You should also
consider getting certified through AWS Certifications. If
you enjoyed the topic of this particular session, I'd
recommend checking out the Solutions Architect associate
and professional certifications, as well as the machine learning
specialty certification. And that's all I have for today.
Thanks for taking the time to listen to me talk about MLOps,
and I hope this content helps you in upcoming projects. I just
have one final request and that's to complete the session
survey. It's the only way for me to know if you enjoyed this
session and it only takes a minute. I hope you have a
wonderful day and enjoy the rest of Summit!