AWS re:Invent 2022 - Productionize ML workloads using Amazon SageMaker MLOps, feat. NatWest (AIM321)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- Welcome, everybody, to our session on Productionizing ML workloads using Amazon SageMaker MLOps services. I'm Usman Anwer, principle product manager for MLOps services on SageMaker, here with two of my amazing colleagues. - Shelbee Eigenbrode. I am a ML specialist solutions architect, and I specialize in the area of MLOps. - And I'm Greig Cowan. I'm head of data science at NatWest Group. Thanks. - Awesome. Thanks, guys. Alrighty. So we have a pretty packed agenda for you today, so we'll dive right in. We'll talk about the what and the why of MLOps, going over some of the use cases that are priority for our customers. We are then gonna get into what SageMaker offers to support those use cases, and the various improvements we have made over the last one year. As we go through, I'll also mention some of the feedback we have gotten, how we're responding to it, then Shelbee's gonna give us a demo on how to use these services, end-to-end, to achieve a use case, and talk about how you can approach scaling MLOps within your organization. And then, finally, Greig is gonna tell us about his journey at scale, About his journey on scaling MLOps at NatWest. So MLOps has many different meanings. Broadly speaking, we refer to it as the process of continuously delivering high-performance ML models at scale. It consists of multiple activities that take place across the entire machine learning life cycle. So these are activities concerning development of machine learning models, activities concerning productionization of machine learning models, and then activities concerning maintenance of an ML system in production on an ongoing basis. So this is a pretty broad area. And MLOps is not just a, What we've learned is it's not just one capability that you can just enable. It's a living and breathing discipline that continuously evolves. It's practiced by different personas who work independently and together across the machine learning life cycle. They do so by following specific processes and mechanisms and best practices, and they need purpose-built tools in order to do that. So our goal, on SageMaker, is to offer these tools out of the box. And the way we measure our success is how these tools help our customers be more agile, deliver more quality models at greater scale and remain cost effective as they scale. Now, before I get into the nitty gritties of our tools, I wanna talk about the use cases that we see in the wild. Broadly speaking, we can classify use cases into the development side of the life cycle and the production or deployment side of the life cycle. And this is just one mental model. Different customers have different approaches. This is just one we'll discuss today. On the development side, we see a lot of customers interested in making it easy to provision environments for their data scientists, to help them get started quickly. We see them standardizing how they perform experiments. On the deployment side, we see ML engineers developing pipelines that can help retrain the models in production. We see them packaging and testing the models, and then, when they deploy the models, they want to be able to monitor them on a continuous basis. And then a lot of our customers want to close the loop. So take the results from the monitoring system, and use that to retrain the models on an ongoing basis. And also track end-to-end lineage of the model so they can reproduce it to debug the models that they have in production. MLOps offers the broadest set of services to support, SageMaker, sorry, offers broadest set of services to support MLOps natively. We offer SageMaker projects to help you create templates that can help data scientists provision the resources they need to get started. We offer SageMaker Experiments to help you centrally track experiments performed on SageMaker. We offer Model Registry, which is a centralized catalog for all of your models that you can use to version control the models, review models for production, track their lineage and configure them for deployment. We offer various deployment mechanisms so you can shadow test and, very recent, AB test your models. We also offer integrations with your CI/CD systems. So once you want to put model into production, your existing CI/CD systems such as Jenkins, Pipelines, etc. can go launch the model on SageMaker Inference. We offer Model Monitoring so you can continuously monitor the health of your model. And, finally, we offer SageMaker Pipelines, which is a built-in workflow orchestration service to help you retrain the model on an ongoing basis. Now, we've been on this journey for a couple of years. We have built these tools out over a number of years, and they've been adopted by several customers who are now starting to report a lot of great impact that they're seeing. We see customers tell us that their time to market, from initial idea to first production version of the model, has decreased by four times. They see 85% reusability of artifacts across multiple teams and multiple use cases, and, as a result of that, they report a reduction of overhead for their machine learning engineers and data scientists, which just makes their job more exciting because instead of doing ops, they can do more data science and more testing. An important use case for our customers, and, oftentimes, this is where they start, is standardizing the resource creation process through templates. Customers really like to create templates that they can expose to their data scientists via SageMaker Projects inside of Studio. When a template is executed, it can go create GitHub repos in the background. It can create pipelines, sample notebooks, model registry, etc. Everything that a data scientist needs to get started. And guess what, they don't even know the complexity that's taking place in the background. Customers tell us this massively reduces overhead, both for their admins in charge of onboarding new data science projects, and for the data scientists themselves. We have a growing list of templates available on our library on GitHub, which can help you provision resources for a whole bunch of different use cases, from provisioning GitHub repos using Terraform provisioning, supporting AWS infrastructure such as encrypted buckets or resources for multi account workflow. And there are more coming. Once the data scientist has their resources, they start to do iterative experimentation. With SageMaker Experiments, you can automatically log experiment metrics and artifacts on SageMaker. Now, customers have given us a lot of feedback here. They don't only want to track experiments conducted on the SageMaker training platform, they also want to track experiments conducted in local notebooks, or in scripts that might be running on on-premises service. They want us to simplify the concepts in the service so they're more digestible by data scientists. They want us to make it easy to visualize and compare and share these experiments. They want better integration with hyper parameter optimization offerings that we have. They want visualizations specific to their use case, such as the parallel coordinate chart that can help you see the combination of parameters that reduces the loss for your model. And, finally, they want a simplified handoff between the data scientist and the MLE by the ways of model registry. So general piece of feedback we have gotten is we need to tightly integrate all of these services so the end-to-end workflow is smoother. And there's some, And this is an area you can expect a lot of updates from us. Now, once the model has been developed, oftentimes, the MLE starts to create a pipeline that they can use to retrain that model against production data, and on an ongoing basis. We see a lot of momentum with SageMaker Pipelines amongst our enterprise customers. SageMaker Pipelines can automate the entire model building process, from data processing to feature extraction to model training to model validation and model registration. Customers tell us the leading reason for them to standardize on SageMaker Pipelines is the built-in fault tolerance. So we have made additional investments in our area handling strategy, our caching and resilience that all add up to that. And, also, they notice that SageMaker Pipelines is completely serverless, so you don't have to worry about maintaining a pipeline product yourself. And, finally it's free, which always helps. We have made four significant improvements to SageMaker Pipelines over the last few months, specifically to help customers iterate faster and minimize their costs. First, customers told us that it is a bit of a pain to test the pipeline, end-to-end, when they're first getting started. If you test the pipeline in the code, In the cloud, it actually goes and creates all the SageMaker jobs that can actually incur some time and cost. So we launched SageMaker Pipelines Local Mode that can help, That allows you to run the entire pipeline on your local machine. In Local Mode, the pipeline would create local jobs, we support all the main jobs such as data processing, training of the model, and also batch transforms. Once you are satisfied with the inputs and outputs of your pipelines, you can make a few tweaks to them, such as add references to production data or add a model registration step that applies to workflows that operate in the cloud. You can then upload this pipeline, and when you run this, now it would actually execute all the jobs in the cloud, and you also get the advantage of all the other tools in the cloud, such as being able to track the execution of the pipeline in SageMaker Experiments. Next, we're seeing a surge in use of AutoML, by a lot of customers, to accelerate their machine learning. SageMaker offers Autopilot, which uses the best in-class AutoML frameworks, such as AutoGluon, to automatically train multiple models for your use case to find the best one. Now, using AutoML in production was a complicated task. Customers had to create custom steps in their pipelines, almost four steps comprising of Lambda functions, callbacks, they would have to start the Autopilot job, monitor it when it finishes, grab the artifact, send it down the pipeline. So we wanted to simplify all of this, so today we launched the AutoML training step in Pipeline, which simplifies, boils that down to a few lines of code. So we are very excited with what customers do with it. It's one of the most highly requested features for AutoML that helps take AutoML to production. Next, customers tell us, "Well, all of this automation is great.", but they gotta be able to adapt it to their very complex enterprise environments. A lot of our customers do machine learning across multiple AWS accounts. So they might have one account where the data scientists develop the models, another where the model is tested or retrained for production and hosted in production. Now, if you want to automate your machine learning life cycle, you gotta automate across these accounts. So we launched support for cross-account sharing of pipelines that can help customers, without having to log into a different AWS account, view all the pipelines that are available and execute them remotely. So you can think about a data scientist who does not have access to production data, or production account. They can simply go discover the pipelines. They can put their model code on a GitHub repo. The pipeline, when it executes in the production account, can take that code, retrain the model, do some testing, send the results back to the data scientist. Again, enterprise customers super excited about this, helps them take MLOps to the next level. Finally, one of the resounding pieces of feedback we've gotten, not only for our UI but also for our SDK is simplification. So we have shipped more than half a dozen simplifications to the Python SDK. Here's just one example where we took two separate steps to register a model and create model artifacts, boil it down to one step, so now it's just easy to add that final registration and creation step into your pipelines. Now, earlier I had mentioned how a lot of customers want to use model monitoring to automate the retraining of the model. SageMaker Model Monitor offers built-in tools to visualize your model's performance and get reports on them on an ongoing basis. It supports monitoring for data quality, model quality, model bias, model explainability. A lot of our customers, over the last few years, have use cases where they're using batch inference in mission-critical scenarios to get inferences on a bunch of transactions. You can imagine a bank that wants to track down money laundering, for instance. They want to be able to consolidate transactions from multiple systems, and then run a batch inference every 24 hours to find patterns that might indicate money laundering. Now, from that example, you can infer, no pun intended there, that this would be a mission-critical model. If the performance starts to diverge, it can create more noise, it can create more work for you than it solves, so we launched Model Monitor support for batch inference. It's super simple to set up. You may already have batch transform jobs set up. You can go and create a batch monitoring job, set it on a specific schedule as part of the startup process, you can create a baseline using a training data set, and then put it into production. Based on the schedule, the job would execute, it would analyze the outputs to any ground crew that you might have or the baseline, report all of those violations in a report you can read through the user interface, but also output it so that it can be programmatically queried. So, yeah. And now I'm gonna invite Shelbee back on stage so she can show you how some of these improvements can be used together to enable a retraining use case. Thanks. - All right. Thank you, Usman, and thank you, everyone, for coming out to the far Mandalay Bay for this session. Sorry, just logging into the computer. I'm gonna swap over. All right. So, in this demo, I'm gonna show you how to create SageMaker pipelines that automate the tasks that are required to monitor data drift, essentially, for your batch use cases. And there's a lot of different scenarios for batch inference. One, maybe you are actually retraining your model at the same time, or at the same frequency that you're performing batch inference. On the other hand, and this is a use case we see quite a bit, and the one that I'm gonna focus on in the demo today, is the one where you're retraining your model less frequently than you're actually performing batch inference. So, for our use case, or the demo today, we're gonna assume that we are retraining our model once a month, based on new data or maybe signals of drift. And then we're actually gonna be performing batch inference daily. So in this particular use case, we're really looking at two different pipelines. In our first pipeline, we're going to use that pipeline to train and baseline the model. So we're gonna do the typical steps. Here we're gonna do the good, old customer turn example, where we're predicting whether or not a customer will turn. So just a simple model. But we're gonna perform the standard model build of preparing data, training the model, and then doing model evaluation. And then if that model is performing according to the objective metric that we've identified, in this case it's gonna be accuracy, then we're gonna go ahead and baseline that model. So in this particular demo, we're gonna focus on data quality model monitoring. So what we're gonna do here is we'll baseline the training data. So we'll perform some statistical analysis of that training data that we'll then use in the second pipeline to compare against and be able to indicate signals of data drift. Then we'll package our model for deployment and register our model. The second, or the second pipeline that we'll go through is for batch inference and model monitoring. So this will be that daily pipeline that runs, that performs your batch scoring. So that being said, let's go ahead and get started. One thing to note, I'm gonna skip some of the model build steps, one, for the sake of time, but also because we have a lot of assets out there that clearly go through those. A lot of different examples. And don't worry, you'll have access to this particular set of notebooks after the session so you can see the full code and dive in. So we'll go through some of the initial setup that some of you may already be used to. We're gonna import some SageMaker libraries to configure the jobs that we'll run as part of our steps in our pipeline, we're also going to import some of the libraries that allow us to specifically configure our SageMaker Pipeline steps. Then, if we scroll down a little bit, you'll see we're just setting up variables that have S3 paths to the inputs, outputs and artifacts that we'll create within our pipeline. And here we're gonna set the model registry name. So we're gonna identify the model package group that we're gonna register this model version to for this particular use case. So each new model version will register to this model package group. We're also going to enable step caching. So we're gonna set up our step caching configuration. So one of the native capabilities with Pipelines is the ability to cache steps, which is really helpful 'cause Pipelines is gonna automatically look at that step, see if that step's been run before with the same input parameters, and if it has, it's gonna automatically propagate those results to the next step without having to recompute that task. So it's really helpful. Say you're tuning hyper parameters, and you don't, necessarily, need to change your data transformations or that data preparation step, but you want to continue to iterate and train your, or tune your hyper parameters. With Pipelines caching, it's going to automatically detect that that was run, so you don't have to rerun that step, which not only saves in time, but also cost as well. So here we're also going to set up a runtime parameter. So Pipelines also allows the ability for parameterization. So what parameterization allows for is the ability to pass in a parameter at runtime without having to change your pipeline code. So you'll see here, we're gonna set up the parameter for the raw input data. So this is the batch prediction request data coming in. And since that is gonna change with every time we retrain the model, this is a great example of pulling out for parameterization 'cause that way we can pass in a new value for each pipeline execution without having to change our pipeline code. So let's jump into the configuration. Again, I'm not gonna spend too much time on these steps 'cause we have a lot of examples there, and we don't have enough time, honestly, but here we're just, essentially, doing the normal task for preparing data using processing jobs. We're training the model using training jobs, we're then evaluating that model using processing jobs. But I am gonna skip forward into the steps that are specific for model monitoring. So with model monitoring, like I mentioned, in this particular example, we're showing you the data quality model monitor. So we're gonna configure a step. Again, it's a built-in step. It's the quality check step. So what that's gonna do is, essentially, spin up a job within your pipeline. And this is using a SageMaker-managed container. You don't have to create it or manage it, it spins up that job, and it's a managed image. It's SageMaker Model Monitor analyzer. And what that does is, essentially, take the input data that you pass it, in this case, our training data, it's gonna perform some statistical analysis on that data, and then generate baseline constraints and statistics. And these are important 'cause these are what's gonna be used for that daily inference, to compare against and detect signals of data drift. So to configure this, similar to most pipeline, you configure the job and then you configure the step. So here we're actually configuring the config for the check job. You'll see we're just specifying the compute environment for that processing job. We're also indicating the input data set that we're gonna baseline, which is our training data set, which is our, From the previous data preparation step. Then we're also specifying the output. This is where we want want that "statistics" and "constraints" file to go. Then we configure the step. So here we're configuring the quality check step using that previous configuration. Then we'll move on, and just configure the steps to package and create them, or package and register the model. And this is, basically, just packaging it to create a model for batch transform in our second pipeline, and then, of course, registering that model version, that we'll use in our second pipeline, to determine if that, or to get the latest approved model. And, again, I'm skipping those steps as well 'cause if we went through it all, it would take forever. But next step, we're gonna configure that conditional step. And this is also a pretty standard step. It's a built-in step that allows you to apply conditional logic. So, in this case, we're going to check for the condition of whether our, or whether our model is actually performing according to the minimum threshold that we've identified, in this case, accuracy. So to set this up, we basically just define the condition, this is greater than or equal to. The value. You can see this is a really low value. You might wanna up the game on your own. And then we configure the step. So this is the conditional step, which says, "If it is above that minimum threshold that we've identified, go ahead and execute the remaining steps in the pipeline." So now that we've configured all the steps, we, basically, want to put all those steps together into a pipeline. So to do that, we define the pipeline, we create the pipeline, and then we start the pipeline. So this is showing where we're creating the pipeline. We list all the steps that we previously configured that we want to run as part of our pipeline. SageMaker is gonna automatically infer order based on dependencies, so you don't have to specify order. We're also gonna specify any parameters. These are the runtime parameters that we're passing in with each pipeline run. Then we're going to create the pipeline, which is an upsert. We'll either create or update, depending if you have a pipeline that exists with that name. And then we're gonna start the pipeline. And this is where you would specify any of your runtime parameters. In this case, we're just running it with the default value. So all of that was allowing us to, programmatically, create the pipeline. If we go and look inside Studio under "pipelines", we'll see the visualization of that pipeline that we just created. So here you'll see the pipeline. It's probably super small for everybody. So this is the pipeline that we just created. So we can run this programmatically with automated triggers. You can run it directly from inside Studio. And then a cool thing to point out, for those that are not familiar with Pipelines, if you click on a particular step within your pipeline, the nice part is we will automatically log all of this metadata. So all of the input metadata, so how was this model created? We'll automatically log outputs from this particular step as well as logs for debugging and just seeing what's going on within the step, and then step-specific information as well. Now, if we go back, you're probably interested in that "constraints" and "statistics" file. So our pipeline completed, the baseline actually completed as well. So let's take a look at some of that data that is created as part of the baseline. Again, that output is stored in S3. So just gonna load up that "constraints" file from S3 into a data frame. And you'll see here, this is the "constraints" file. So it's automatically, For each feature on input, it's detecting and inferring certain aspects of that. So completeness. This is magically complete. You see a lot of 100% there, as well as is it positive values, negative values on input. You'll also get a "statistics" file. So this, again, for each feature that's on input, is doing some statistical analysis to come up with the things like the min, max, mean, whatever is applicable to the feature on input. So that being said, these two files is what's going to be used against to detect signals of data drift during that daily pipeline run that's performing your batch scoring. So let's go ahead and move on into that. So in this second notebook here, this is where we're focusing on that batch inference and model monitoring pipeline, which is gonna run every day. So this pipeline will start with our batch prediction input data. We'll also note the model package group that we want to use to pull the latest registered-approved model from. And then we will go ahead through the pipeline, get the latest approved model, we'll run the monitoring job, then we'll run the batch transform job. And the output of this particular pipeline is gonna be our prediction output data, but also our monitoring output. And that's gonna be in the form of monitoring reports stored in S3, as well as events admitted through logs and CloudWatch. So let's run through this one. Again, some initial setup. Importing the libraries that we need to create the pipeline. Here we're also specifying two key outputs. Those are your batch transform results in S3, as well as your monitoring reports that'll go in S3. The first step is we're just gonna configure a Lambda step. And what's nice about Pipelines, it does include a native step for Lambda, and this is great for applying or incorporating any custom logic or custom tasks within your pipeline. So we're just gonna have a simple Python function that simply goes and gets the latest approved model from the model registry that's gonna be used in our later steps to actually run the inference as well as model monitoring. So what that Lambda step is gonna do, it's gonna get the latest approved model, and then collect metadata from the model registry that's used for those later steps in the pipeline. So this is just our Lambda function, simple Python code. And then we're gonna configure that Lambda step. So to configure the Lambda step, there's just a built-in helper function, where we're pointing to the code for that Lambda function. And then we're specifying the outputs. In this case, this is the metadata that we need for the later steps within our pipeline. And then we're configuring our Lambda step. So this is a built-in step where we just specify the function as well as that model package name from the model registry where we're pulling that latest-approved model version. So that being said, we'll move on to our next steps, which is actually to configure the batch transform job that performs batch inference as well as the model monitoring job. And it's one single step within Pipelines. So the monitor batch transform step is a single step that actually spawns out two parallel processes, or processes that can execute in the order that you define. And those are a processing job, once again, with that managed container image that is going to run monitoring against the statistics and constraints baseline that you created in your model training pipeline. And then it's also gonna have, of course, the batch transformation job as well. And to do that, just similar to every step, you go ahead and configure the step, or configure the jobs, and then configure the step. So here we're just indicating where our prediction data is. This is data that we've specifically loaded with some violations, and, inside here, we're passing in another parameter. So because, again, this pipeline input data is going to change every day with your pipeline, you can pass this in as a parameter. So first up is to configure your batch transform job. If you're familiar with batch transform jobs in SageMaker, this should look familiar. The second one is to actually configure the check job. So this is actually the monitoring job itself, where you're essentially specifying that input data. So this is the batch input data. And then you're also specifying the location that you want that monitoring report to go to. And then you just configure the step. So it's a single step that you configure that's gonna spawn those two processes, essentially, and using that configuration that you previously defined. A couple things to point out. It is the same container that's used between your baselining as well as the monitoring. You just tell it to act in different ways. So here, what we're saying is, monitor before batch transform. This is a monitoring job, it's not a baselining job. So we're saying here, we want that monitoring to run before you run the batch transform job. You can also fail on violation. I have it set to "false" mainly just to show the pipeline, but it's something you'd probably want to consider so that if you are seeing signals of data drift, you fail your pipeline instead of going and proceeding into that next step to run your batch transform. Then, once again, we're putting it all together, same exact steps again. This is a one-time creation. Of course, you can update that pipeline. And then we start the pipeline execution. And we'll just take a quick look at this pipeline. It's a really simple pipeline. It has the Lambda step, has the monitoring, and then the batch transform job. So three simple steps can run on a repeated daily basis. And if we go back and look, like I said, we did load this one up with some violations. So what happened during that pipeline run when we had that processing job? Those are two outputs. The monitoring violations report, as well as the CloudWatch logs that get emitted from that. So let's take a look at both of those. This is the constraints violations report. Just gonna load it up into a data frame from S3. And you can see we found four violations. So we have four features on input that are looking different than our baseline data. Previously within our baseline, it was 100% complete. Now you're seeing, 92% to 98% complete. So, in this case, it did finish with violations. And you can modify these constraints depending on your business data and what's acceptable to your use case. The other thing to point out, and this is typically interesting to MLEs or DevOps teams, is that it does emit, or data to cloud event, CloudWatch logs. And, inside there, if you query the logs, you'll see "completed with violations" for the monitoring job. And this is just showing that. This will only occur in the case where there are violations. So why this is important is you can essentially use this log data to create a metric, create an alarm that you can then notify the team on. You can stop the pipeline, whatever makes sense for your use case. Which brings us into the next part here, in terms of handling monitoring violations. A couple things, and we always recommend automating the exception handling. So automate those logs and the alerts when there are violations detected. What you do with those may depend. It may require manual intervention from a data scientist to look at it, see if it's within acceptable ranges before you go ahead and do the batch transform. But you could also automate it with your retraining pipelines too. If there are violations detected, you have ground truth data to retrain on, you can go ahead and then trigger your retraining pipeline. So, that being said, we went through a lot. We focused primarily on automation and model monitoring here. We could continue to evolve this and mature this to incorporate things like continuous integration, source control, version control, and continue to mature these pipelines. Which brings us into our next topic, which is really about scaling MLOps. And we'll introduce a framework here. So we're gonna quickly look at a typical journey that we see customers take when they're incorporating MLOps into their technology practices. So we look at the journey in four stages of adoption, from initial all the way to scalable. And as you move across those different stages, you see increased operational efficiencies as your MLOps maturity grows. And let's take a look at some of the deeper capabilities within each stage, starting with the initial stage, where we're really focusing on, how do we implement some standardization around our tooling, how do we make sure data scientists have access to the resources that they need, like compute, like data, by standardizing our experimentation environments as well as starting to standardize a mechanism for experiment tracking. Then we move into the repeatable level, where we're really focused on automation a bit more. So how do we remove some of those ad hoc aspects that we see in the initial phase when we're just moving out of proof of concept and we're moving into more of the repeatable stays, where we're not only interested in reducing our time to POC, but we're also interested in reducing our model deployment time as well? So this is where we see a lot of automation come in, in terms of automating your model build, your model deploy pipelines, automating access to resources and data science environments for your data scientists that automatically include all the best practices around governance, security, the things data scientists do not want to worry about. So all of that is essentially automated. And we also see here starting to standardize on source code repositories, good practices with source code control, and then centralized management of models done through SageMaker Model Registry. Then we move a little bit into the reliable stage, and this is where we see customers starting to incorporate full CI/CD practices like the source inversion control consistently used within Pipelines that are automatically triggered, also implementing quality gates, implementing model monitoring, and then into the scalable phase, where we really start to see the standardization of templates that are not only used within a single team, but across the organization. So here we're really focusing on reducing that entire machine learning development life cycle time. Now, this is just high level, you may be looking at that, thinking, "I have certain practices that span different parts of the journey", which is completely typical, and this is just a typical pattern that we see, and it's a high-level view of that journey that we often see with customers. But the point is, it really is a journey. And as you look to increase your operational efficiencies through MLOps, just keep that in mind. It takes some time to get all of that stuff incorporated. And, speaking of that, I'm gonna now turn it over to Greig, who is gonna tell us about his MLOps journey at NatWest. - Thank you, Shelbee. - You are welcome. - Good. Hi, everyone. So how many of you have ever worked in an organization where it takes too long to get something done? Yeah, everyone. How many of you have struggled to get your data science and ML workloads adopted across the enterprise? Hey. Okay, yes. Well, you're not alone. We were there. And what I'd like to do, over the next 20 minutes, is really give you a flavor of where we've come in our data science journey with NatWest Group, what we've built with AWS and how that's let us deliver an MLOps solution that's secure, scalable, and sustainable across the enterprise. And, ultimately, that's really letting us get more value from our data more quickly. So for those of you who are not, Who don't know, NatWest Group is one of the UK's leading business and commercial banks. We have customers, I think one in four businesses across the UK are with us, from startups to multinationals. We have a large retail organization, and, together, we have about 19 million customers that we support to thrive. We have a growing community of about 500 data scientists and engineers across the organization, and they're all passionate about trying to use the large data asset that we're collecting on our customers to really make a genuine impact to their lives. Many of the brands in the group have histories dating back 2 or even 300 years, and that's a legacy that we'll come back to touch on in a minute. But over that time, the bank has created thousands of different models, rules-based systems that cover all aspects of what we do, from capital allocation to fraud detection to delivering prompts in the mobile app. And, for me as a data scientist, that's a fantastic opportunity to work within, to change. And we're really trying to change how we deliver value from that data using ML operations. I think, as you have in many large organizations, a lot of the challenges that we faced before ML operations can be placed into these categories, four categories, people, process, data and technology. So we have a lot of talented people in the organization, but they've not, historically, had the right training support to work with cloud, and to work in this software development mindset. Secondly, on process, we've grown organically over time, and that leads to many legacy processes that get in the way of innovation and time to value. I've seen so many cases of data scientists working in one system, writing code and models, who then hand over to engineers in another system, who then implement that in production. And that just leads to frustration on all sides. Data is often siloed, difficult to discover and access, and our technology estate was often fragmented, out of date and, frankly, not very attractive to work with for the next generation of staff we're looking to recruit. So with these challenges fresh in our mind, we then focused on building out our MLOps vision, and these are focused on, I guess, four main themes that are tied together with that faster time to value. So the first one's, really, how do we standardize on patterns for the management, the creation of infrastructure, and the management of models and the pipelines that support them? Second, how do we use MLOps to challenge the existing governance processes, and come up with simplified procedures that let it us go faster? Third, how do we break down those silos, and simplify that data access across different teams in the enterprise? Then, finally, how do we create a modern tech stack that's supported by a federated operating model that gives power and autonomy to the data science teams themselves, to self-serve that infrastructure, and have ownership of the end-to-end solution? And, as I said, really, the key point here is how do we go faster? How do we deliver value from our data more quickly to get those insights on our customers more quickly? So what we've done, What we've done with the AWS over the past couple of years is, really, realized this vision of ML operations. So what we've done is right from the start of that engagement, we were laser-focused on a set of metrics that we wanted to use to really articulate the story of where we were, and where we wanted to go to. So these metrics were really focused on that time to value and how do we go faster to deliver end-to-end solutions, to simplify data access, to get things live and to allow the self-service creation of environments quickly. Typically, as a data scientist, when I joined the bank, four years ago, we were in a really, Very much in that left-hand corner of this maturity scale that you see here. When I joined, it could take you weeks to get even the most basic Python environment up and running. And that was only after talking to 10 different teams, or looking at 10 different Wiki pages. And that was before even looking at any data. So it was really a big struggle to get access to data, to build some models and then to try and get them into production to get that insight. What we've now done is really work with AWS to build our ML operations environment, centered around SageMaker. and that now allows teams to, We're now training deep learning models on multi-GPU architectures in the cloud, and getting those solutions live into production in a matter of months. And that was something that just wasn't possible within the group even 12 months ago. So what we've done is, What I wanna do now is just talk you through, roughly, what our SageMaker architecture looks like, and it really all starts with, what we call, our shared service account. So here we've got a centralized account that's owned by a central platform team that hosts common resources artifacts that we use across all the different use cases in the enterprise. So, for example, we have common Docker images that underpin the SageMaker Pipeline steps that Shelbee spoke about. These are stored in an ECR. We've got shared and pre-approved infrastructure products that sit within service catalog, and other artifacts that are related to model pipelines and their promotion through into production. How it then works is that a team will then create a request, via ServiceNow, to that central platform team to provision three accounts that they then use for their infrastructure. Development, testing and production. And those all have secure connections to our enterprise data lake that ties everything together. Once those accounts have been provisioned, the team itself can then self-service via their service catalog, the provision of user roles, SageMaker Studio domain, and other products into those accounts so that the teams themself can start working on the use case. At that point, there's no connection back to the central platform team, and the team itself is really self-sufficient at that point. The names of the accounts development testing and production relate back to how they're used in the data science life cycle. So, in development, you'll obviously have data scientists and engineers using notebooks, starting to develop data analysis, data wrangling, start to build their first models, and start to build their first pipelines for training and for inference. The testing account then mirrors what you have in production, and is used for, obviously, testing your different inference workloads before you then promote them through into production. Once the data science team starts to work, and you build some models that you're happy with, we then have another role, called model approver, that comes into play. They can then look, via the Studio interface, to understand different metrics related to your model. Is it performing to your business requirements? They can check things for data bias and explainability, and once happy, they can then promote those models and pipelines through into the test and production account. And that's via a few clicks in the console. And that then gives us the ability to have these inference pipelines in a production environment. And what's really crucial for us is that is, Until we had this system, every use case, every team, had to create their own route to live in a very bespoke way, going through architecture, boards, design forums. And it just led to that duplication of effort, but also a lack of standardization. And now, with this system, we easily, for each use case, we then have that route to live built in from day one. And that just, That allows the teams themselves to simply focus on solving the business challenge, and not having to worry about all these different components, and how to manage that infrastructure. For me, one of the key things, one of the power points of SageMaker is it now gives us also that standardization for data teams to really, yeah, focus on the business challenge. And it gives them a common language that they can use to talk to each other about how to create those production pipelines. So now, when I dive into conversations with different teams, I hear them talking about training jobs, processing jobs, model registry, experiment tracking, and we now have that common language that really helps teams across the bank support each other and share that knowledge, and, ultimately, lets us go faster, as an organization, to build these workloads. One of the challenges that we quickly faced once we'd launch this platform was the fact that we had a small number of centrally-managed Docker images that support those SageMaker Pipeline steps. And we found that that quickly became unfeasible because all the different teams we were working with had a whole bunch of different requirements for their data science workloads, so we quickly engineered a system to let teams come with their own, Bring your own mindset for the Docker images. So, from within that development account, a data scientist or engineer can use a code-build project to then build their own Docker image, which they can then push to ECR in that shared service account, and that then makes that available for anyone on the platform to then use in their workload, whether it's in their model training step or it's in some inference step or a processing job, they can then, They can then use that, and it has the inbuilt dependencies for that particular, For that particular use case. And that just gives teams the flexibility to then control how they want to develop. As you heard from Shelbee, a key component of this is that SageMaker Pipelines, and then, within Studio, the concept of SageMaker Studio projects. Again, when I initially approached this, I was concerned that this structure would be overly constraining to teams, and would limit how they could, How they could create and solve problems for our own customers. However, what I found is, actually, that structure that it brings is liberating, and really helps teams focus only on solving the real business problem, and lets them, They can then rely upon the structure that comes with Pipelines to ensure they have the guard they need to be compliant with what we do as a bank. One of the things that we noticed within SageMaker Studio, out the box, you have a lot of pre-built pipelines and projects that you can rely upon, but we found that with our working as a heavily regulated FS firm in the UK, we have a lot of additional compliance steps that we have to be, We have to be compliant with to do with data checking, bias monitoring, explainability. And one of the advantages of using SageMaker Pipelines is that the teams themselves could then quickly adapt and modify these templates to their own needs, such that we had all these inbuilt steps. This was done once by, This was done once, and then made available to everyone on the platform to then rely upon for their own, For their own use cases. And we're now, yeah, We're now continuing to expand this list of custom pipelines to include additional frameworks and processing steps that we need to meet our needs. Once the platform was launched this year, we quickly moved into a mode where we tried to focus more on the adoption and engagement of the platform across the enterprise. We didn't just wanna focus on a few high-performing teams to use this, but wanted to get adoption across all aspects of how we operate. So we focused on a few things. First one, we trained hundreds of data scientists and engineers in how to use AWS services. Second, we established a well resourced enablement squad that really embedded themselves with all the different use case teams to get their projects onto the platform, and educate users how to work with it, but also to gather continuous feedback about the platform we created, and use that to make future improvements. We were really clear about having visible metrics within that process to ensure that we could continually track and report on how that adoption and engagement was going so we could have that conversation with our stakeholders to really ensure we could celebrate our successes, and also understand where the problems were gonna lie. And then we really reached far and wide across the organization, not just focusing on traditional data and analytics teams, but looking at teams in fraud, finance, audit, climate, to get them on board, and to identify user champions within those groups to really advocate for the use of the platform. So just because I'm trying bring this to life a bit more, one of the things that we had noticed before we launched this platform, was just to get started for a team would, typically, take you 40 or 50 days just to have an environment in which you could look at some data. So if you had an idea in NatWest last year, you had something on a whiteboard, and you went to build a solution just to have some very basic SageMaker environment up and running could easily take 40 to 50 days. So that just leads to that delay to value, and that delay to innovation. Since we've launched this platform, we've drastically reduced that time down to one or two days. So that automatically helps getting things up and running. Then, within that, once you have those baseline accounts that are created, teams then have the ability to self-serve the creation of their SageMaker Studio domain and other products within that, and that can be done in a matter of one or two hours, in this case, whereas before, that could easily take the central platform team up to a week to create that bespoke environment for each individual. So, again, this really just helps us get more value from our teams and our data more quickly. So since we've launched, about six months ago, we now have 13 teams across the bank on this platform. Over 30 use cases are now building and developing ML solutions on that. We've got hundreds of projects and pipelines up and running. Tens of thousands of processing jobs, and we've now trained over a thousand models in our development environments. So this is just gonna continually help us adjust, and move the dial in terms of how we get ML adopted across NatWest. One of the use cases that we've recently put live on this platform is what we call, customer conversational intelligence. So, for many years, the bank has a chat bot, has telephony centers that our customers use to interact with us for access to services, and to get that support. We found that a lot of that data that links to those conversations was sitting in a database, not being used. There was no insight being generated from that to help with things like optimizing and personalizing customer journeys. There was nothing there to help with customer self-service support, and there was no insight to help our operations teams optimize and improve their own efficiency. So with this platform we've built on SageMaker, we then created two new ML models that look at those conversations, both the web chat from our chat bot, and also the voice transcripts that come from our telephony centers. So we have around 100,000 conversations a day from the bank, and we now built, We built two models, one using BERTtopic and another one using Fasttext classifier to do two different things. One was to create a reason for the conversation, to help us understand why the customer was contacting us, and the second was to look at the resolution status of the conversation. So was the conversation resolved, was it dropped, was it deferred to another agent? And these insights, we can then provide those back to the business units to help them with that optimization of the customer journeys. So just a few key takeaways from today. I think the first thing is, really, don't forget about the hearts and minds in this process. When we started this at NatWest, we realized it was a major transformational change to the organization that we were going through. So the vision that we created for ML operations had to be, Had to be compelling and business led, and the adoption and engagement had to be, Had to be excellent for the community. We had to make sure that we could build for complexities. Data scientists and engineers are a smart bunch, so we had to make sure that we could consider that bring your own mindset, both for Docker images, for Pipeline templates. I think, also, MLOps, it's a journey. So you have to be flexible in what you do here. You have to get that continual feedback from your customer, deliver value quickly that respond quickly to their feedback, and make sure that you can then build what you need for your own enterprise. The operating model that you have has to be thought through quite carefully, but I think that was a mistake that we made, initially, we didn't think too much about the operating model, but that has something we've now retrofitted into the platform. And I think, for us, that federation really helps putting the power back into those teams that sit across the business, to give them the opportunity to work in that federated and self-serve way. And then, finally, as with any large organization that's grown over time, typically, you have a lot of legacy tech that sits there. So whatever you build, you should really make sure it can integrate with those other data sources, or other systems that you use in your enterprise, either on cloud or on premise. So, in terms of where are we now, so we've definitely moved, significantly, up this curve that you saw from Shelbee. Really, now, we're at the point where we're, Project delivery could easily taken 12 months plus from before. We're now really rapidly getting to the point where we're delivering projects with ML at the core in less than three months. The data access piece is significantly shorter, can do that within a day, and really that self-service creation of those environments, which could, before, have taken many weeks, can now be done in a matter of a few hours. So this really helps us move much more up that curve towards that scalable phase, and really helps us get much more value from our data much more quickly, and, ultimately, helps us support those 19 million customers that I mentioned at the start. Thank you. (crowd applauds) Sorry, I realized Usman's back. - [Usman] That's fine. I'm just (speaks indistinctly) Thanks, Greig. That was fantastic. All right, everybody, so that
Info
Channel: AWS Events
Views: 8,486
Rating: undefined out of 5
Keywords: AWS, Amazon Web Services, AWS Cloud, Amazon Cloud, AWS re:Invent
Id: mRNcVKJ6UNo
Channel Id: undefined
Length: 55min 55sec (3355 seconds)
Published: Fri Dec 02 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.