Machine Learning from Development to Production at Instacart | Instacart

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] I'm Montana I work at the machine learning platform at instacart part of our mission is to build a firm foundation for all of our data scientists to create the most sophisticated models they can possibly conceive so that those deliver value back to our customers our retailers our shoppers and our CPG partners the other big part of our mission is to empower our normal software engineers or data engineers to leverage machine learning in their organization without necessarily having all of the expertise that our data scientists have when it comes to sophisticated modeling so the way this plays out in practice is mostly what this talk will be about I'll spend ten minutes in the beginning explaining high level machine learning concepts for the data engineers in the room that may not be familiar before we go into any of that I'll just tell you about a little bit about instacart so you understand why we care about data why we care about data science machine learning and data engineering our core value prop is that we deliver groceries in as little as an hour from the stores that you love we connect customers to a personal shopper that deliver the service directly to the customer in real time we partner with local retailers to give the customers access to stores they are familiar with and also that they trust we also offer coupons and other incentives from CPGs product manufacturers for people not in the industry they can increase the value of our proposition back to the customer the customer experience this begins by them choosing a store we partner with hundreds of retailers throughout the country to give them the largest lection possible they'll select the groceries that they want from that retailers warehouse they'll build their shopping cart they'll check out and choose a delivery time in as little as an hour and then we'll deliver those products to their door from the shopper side of the equation the shopper will get a notification there's a delivery available to they will choose to accept that they will go into the store they will shop for all of the items they will scan each item so that they get strong verification signals they can also replace items if it's out of stock eventually they'll check out with an insta cart credit card at the retailer and then they'll deliver those goods to the customer one of the key ways we use machine learning and instacart is on our search and discovery team this is probably the most familiar part of the product to anybody who has used service we we use supervised learning in several different examples this is probably the most prominent form of machine learning in the industry right now that's currently delivering real value the when it comes to pure text matching on search results it's often not going to be good enough when you have subtle differences like milk milk chocolate or chocolate milk that need vastly different search results the way we improve this is to see what what features are in these products we break every product down by its features what what is the brand of the product what is the fat content is it organic is it pasteurized homogenized how big is it where was it made we can go all the way down to the dominant color in the image to see what exactly represents this product we're going to encode all of this information numerically additionally the fact that some of these products are missing information becomes important when it comes to ultimate search result quality if something doesn't have an image it's probably much less appealing to a customer and so that kind of gap in the data is actually a really big information signal when it comes to machine learning it takes a lot of domain expertise to understand all of these features and their relevance to customers which is why we embed all of our data scientists and machine learning engineers directly into small teams with product managers working closely together on a day to day basis we don't have a centralized machine learning or data science or data inch team every product in our catalogue shares a set of come on right side I think I got a slide behind I'm sorry guys when we talk about an individual feature the key to successfully designing these features is translating our human understanding to numeric values we call this encoding for example colors are continuous from red orange yellow green blue indigo violet we can number those one through seven software engineers of standardized on more than 16 million colors between red and violet so computer can tell you with much greater precision what color this milk is than I can I would just call it cream every time a customer searches for milk and adds a product to their cart they give an instant dozens of training examples they tell us the one product they added to their cart is milk and all the other products we showed them are not a match for milk not everyone agrees on what the best match for milk is on instacart but this is where machine learning really shines the features of each product are what's important not any one example the model will be trained to understand which features are statistically Milky and which are not homogenized pasteurized grade-a are all important from up to half keeping in mind that the model will also be trained with orange juice which can be pasteurized but it's often not homogenized so machine learning models consider dozens of features holistically before they make a single decision when we see new products from new retailers at new stores we can instantly serve great search results because our models have learned to generalize from individual features not the products themselves we can scale our catalog faster and more accurately than trying to hand label all of the products in the world once you move to a numeric model of the world you unlock all sorts of functionality also you should show you some of the ways we build on what our machines learn from search one of the deep learning superpowers that we have is called transfer learning we can use something learned in the previous model about what makes a good search result for milk all of these products we can transfer that to a help predictions on a news task in this case we want to find which products are competitive in our catalog so we can help our customers find them together just like they do in brick-and-mortar stores Coke and Pepsi are very similar numerically in a search for Cola but it turns out very few people add both of them to the cart at the same time that signal that some products are similar or rarely purchased together allows us to merchandise them as competitive with each other we can apply this same model the opposite direction and we can say that when people buy peanut butter they often buy jelly and so we see the merchandising opportunity there as a complementary instead of competitive product that's very quick look at what machine learning is and how it can be used to improve say a storefront I want to talk to you about a product project that's more top of mind for me right now which is actually building the open-source library that we use on a day to day basis we're using it now throughout the org four out of four of our different data science teams are using it on all of their newest projects we still have a large number of legacy models that are being deployed with various bespoke techniques that we've developed over the last five years but moving forward we'd like to standardize how we develop and design models so and the really interesting thing here is that as we've taken our experience from building models by professionals who have had you know dozens of years of industry experience combined from different organizations we can boil those down into the set of best practices and what a typical model looks like what its lifecycle is how we want to deploy those how we want to maintain those how we need to build those that's going to work more or less generically and universally in this specialized supervised learning case so there is that caveat this is a fairly complicated slide we're going to break it down piece by piece and walk through it and I'm gonna put code up on the slideshow because that's the language I really like so that we can break it down into a step-by-step process of building and deploying a model into production at instacart lor is the name of the project it's available on pi PI as a pip package you can pip install it that's always the first step it's it's similar in its command line conventions to Rails instacart is a rails shop so if you've ever done web development you can call rails new you'll generate this big framework dump of a directory inside your your github repo and and so a lot of a lot of the thinking of convention over configuration of simplicity of ease of use of being opinionated about certain things that there are a hundred different options and most of them are good most of them have trade-offs we saw earlier benchmarks between snowflake redshift and bigquery does it really matter maybe but ease of use is what we want to solve for most of the time we want to solve for developer productivity so though the way you create a new machine learning project at instacart which may be used for a single model but often time they're used for a whole family of models like we just saw in the search example when you have one model very quickly you'll start to find all of these other applications where you want very similar combinations of features slightly different objective functions as a result and so keeping all that code together in the same project really enables us to reuse and create more modular testable systems the first command here like I said was you install the lor package that will give you a command line application you can lower in it a new project and that will give you a directory with a base skeleton that's deployable on Heroku you can use a Heroku build pack to deploy these things into production you can wrap them up in a docker file if you want and basically all that means is you'll get an empty flask app running nothing because we haven't actually created a model yet the third command here at the dollar sign is to generate a scaffold and what we're going to to predict is when delivery is created three or four months later occasionally you know we'll have somebody call this up and say hey I never did that and they'll they'll issue a credit card dispute on their credit card so this is a very costly thing when it happens to instacart even though it's an incredibly rare event which makes it a very difficult machine learning problem and it's a very important problem to solve is to know if is this delivery a legitimate delivery is are the features that we have the signals that you have at the time the customer checks out before we've paid you know for $200 in groceries and spent an hour driving it over to their house is this the person's credit card do we actually have a legitimate transaction taking place here and I think you'll see machine-learning deployed pretty broadly across all types of fraud detection in the industry so what we're doing here is adding a regression parameter to the scaffold so we're going to generate a regression model if you if you want to think about supervised learning models there are broadly two types they're classification where you're trying to predict is it a B or C and then there's regression where you're predicting a continuous value a numeric value so 0 to infinity or negative infinity to positive infinity a lot of people will bound it from 0 to 1 so regression is predicting a continuous range of numbers category or categorical or binary categorization or multiple classification these have specific discrete buckets that you're trying to predict things in we could frame this problem as a binary classification true or false we can frame it as a regression probabilistic problem of from 0 to 1 how likely is it to be fraud so 0 being there's no chance one being there's a hundred percent chance of it being fraud the data scientists will have opinions about which of these is the appropriate technique true or false your product manager will have opinions about do they want to true or false or do they want a sliding scale of 0 to 100 mm-hmm so this is actually a very important design decision on what you're going to do with your machine learning model it's why it's it's escalated all the way to the command there's a lot more command-line parameters you can add if you want you don't have to generate the full scaffold if you don't want if you just want to generate any single one of these things we'll go ahead and talk about what all of the types of things that are being generated are for now I'm gonna gloss over the six files that come out other than to point out there's some tests there's some ipython notebooks which anybody who's done data science should be familiar with and then there's three Python classes that are generated we'll start with the data source and create an extract everybody who's at in data engineering is familiar with ETL processes this is strictly an extract it typically looks like a sequel file in our case we're only going to have one feature and the very first version of our model for to keep this short so that we don't make it a happy hour in this case we're going to look at the visitors IP address what is the latitude and the longitude that they had the delivery delivered to and then we have the truth the training of was this credit card charge actually disputed within some time frame so we can actually measure the distance between the we can geo locate the IP address and then we can measure the distance between that IP address and where the delivery address is and see is this person actually at the place that they're having it delivered room are they in you know Russia or just testing fake credit card numbers to see if they can find some new way to scam us that's a pretty quick query but you'll notice that it's actually going to return every delivery from all time we'll run run this against our data warehouse redshift something that actors no actually handle that kind of load and then we're basically pulling that entire table into memory for these four columns the the next step in the pipeline will be to run that data into a cache and then to split it apart it's very important when you do machine learning that you you want your training data but you also want to other sets of data split out of that you want your validation data which you'll use pretty soon and then you want your test data the reason you split those out is because you don't want your algorithm to get to peek at the answers that you're going to test it against you're going to show it a bunch of data let it train on that well you've held out some data for later and then after it thinks it knows the right answers you use that data to test it again and we'll actually have two different test phases in many of our machine learning models one is an iterative test that you go through continuously and you constantly try and optimize your machine learning against that data set if you do that though you're running the risk that you're optimizing the data set just for that little thing so after you're all done once you finally put your stamp of approval said I'm happy with the results that I'm getting then you get to actually use your final test set and that's your final score that you would use to say whether you have improved upon the previous efforts so going going forward will will actually construct our pipeline this pipeline is probably not like most of your data inch pipe lines this was sort of an overloaded term that we use in machine learning this pipeline inherits from the base lower pipeline class this is a holdout pipeline so it's going to hold out those two extra slices of the data set and it has basically one job which is to return it a data frame in the get data method its lore is an object-oriented language so you typically will inherit from a base class and then you'll override one or two key methods or the whole thing if you really want to get crazy so you can see here that we're laurio redshift is our pooled redshift connection that has appropriate cadential just configured for the project and the configuration file this is all available we have very many connections in law Ryo we have s3 access we have Postgres access we have Redis access we all of these things that our wonderful data engineers provide to us pre-configured for our software engineers and our machine learning engineers to access with a single function call data frame returns a panda's data frame if you're familiar with Python it's basically and then in memory table you can perform many of the sequel like operations we're used to in highly efficient manner an interesting caveat it is is that its columnar similar to our data warehouses it's not row oriented so your operations should be you should just keep that in mind when you're working with it the final thing to keep in mind is that we're using the sequel file here that we created previously and we're going to cache that you don't have to cache if you don't want to but I like caching and the next step is to actually take all of the raw data that we had and we had an IP address and two different allonge and then the final answer and what we actually wanted to calculate was the distance between that IP address and the that and the launch so this is a bit of a contrived example but these are encoders inside of lor back inside our pipeline we're going to define a second function and it returns the encoders that are appropriate that we actually want to use to construct the features for our machine learning model so in this case what we've said is that we want to calculate the distance between these two the latitude and the longitude and then we're going to use the GI geoip transformer on the IP address to get this latitude another transformer on the IP address to get the longitude all of those will be in input 2 the distance transformer norm is the actual encoder the difference between encoders and transformers is that encoders are stateful transformers are not transformers or pure functions so they don't need any memory they don't need to have ever seen anything any data like this data to know the on the other hand machine learning models particularly deep learning models they don't like very large numbers if you pass it 10,000 in one feature and your other feature only is between 0 and 1 the 0 to 10,000 feature can completely dominate and swamp the other feature so what norm does is it looks at all of the values in that feature and it divides it by the average and the standard deviation I'm sorry subtract the average and divides by the standard deviation so that you get a small number around 0 to 1 throughout for most of your inputs which is nice for most machine learning so but of course that's stateful and it's dependent on that particular model so this is something that we have to keep track of for all time now because during the training data we learned what the average and the standard deviation of this was and so we have to now save that with the model and keep that going forward so that any new data that we see later on we can apply the same transformations and encoding to it for the the final bit we're going to get the output the output is this disputed column that we want to predict whether this delivery will be eventually a credit card dispute we're going to use the pass encoder on it which simply says take the value I'm giving you it's basically a no op because we like the fact that this is already a boolean from the database it's already 0 to 1 once we got that that's basically the end of our pipeline Laure will take care of doing pretty much all of the rest to finish our model we need to inherit from the base class that we choose in this class we're going to use the Charis library which is built on tensorflow it's a deep learning library you need to implement the constructor which basically passes a pipeline in an estimator you don't actually need to subclass because all we're doing is we're creating an instance of the superclass with these two parameters so you could just instantiate the superclass and assign the pipeline and assign the estimator that we want to use the estimator that we're using is a default lor estimator so we're gonna actually swap out the binary classifier this is because halfway through my talk slides I decided I'd rather use binary classification for this problem because my product manager made a last-second change rather than the regression that I started with so it's very important and it's very cool that I can start with one type of model I can make a one-line change here after I've generated my scaffold and I'm not boxed into a corner or if I'm actually doing the testing like I should as a good data scientist then I'll actually be able to compare well what does it look like if I'm doing binary classification how many trues and falses do I get versus if I do a linear regression how are we going to actually choose the data point that decides what action we take you know maybe it's very likely then we completely block it if it's some I like Li we send it to review if it's moderately likely we just throw up an error on the website this is our product manager questions so that that basically gets us our full model that is a machine learning model you have all this now you can then go back to your command line you can run lower tests you can make sure you didn't make any typos this will build the model it will run the sequel against your test database that you've already set up make sure that your sequel actually executes returns rows it'll take those roads it will feed it all it's generally very quick as long as your test database is very quick hopefully you don't have eight billion rows in your test database locally on your laptop but if you did that would be a pretty exhaustive test those are two automatic smoke tests generated by the framework when you generate your model you can of course write as many unit tests as you want when you're thinking about what could break in my data which is which is actually a very important practice the the next step of okay great we've got a piece of software it's on my laptop that's fantastic I want to deploy this code to a central server which instacart uses a mono repo all of our code is in the same github repository for the entire organization all our projects were just in subfolders which is really interesting in practice it works surprisingly well because I can grow up the code base for any string and I can see who's using any column in the database or any other features that I'm interested and in this case I can check my code in that creates an automatic deploy it'll go out to a staging server I can SSH into that staging server in that case it has the real redshift connection it has everything that it needs to be running against production data I've already tested locally so I'm not afraid of it breaking I can then call lore fit which is another command then I pass it the name of the model that we just created which is this loss prevention to predict delivery disputes it'll go ahead and fire up it'll start training on as many samples as it has this is running against my local laptop database with a hundred samples it's going to validate on 10 sec samples ten samples are invisible here which are being held out for later testing you can tensorflow is really cool because it gives you this ETA on how long each epoch is going to take you can see this is the loss the loss is basically the difference between the predictions it's making and the true answers it's getting to see during training so you want this number to go down over time as you let this thing run for some number of epochs each pass through the full data is considered in the epoch you'll see that the loss will go down from its initial value that's very high say 105 it'll get down 2.5 5.5 3.5 to with each epoch it will almost always continue to go down because over time the model will start to memorize more and more of the data because seeing the state of multiple times so it can start eventually it can just memorize the whole thing and then you'll get zero loss you'll get perfect predictions but the important part of the validation is that it has to then go and make predictions against data that's never been trained on before so it's got this second set of data and it starts making predictions over here and this is what we're actually looking at this is what we're actually considering this will very rarely be better than this and you'll see this number will come down over time but eventually this number will start going up as your model starts wasting more and more resources on memorization that don't generalize to things that it's never seen before and as soon as that starts going up patience is a parameter you can pass during training but you say you can wait maybe five times or zero times however many you want as soon as that starts going up you want to stop training your model because it's wasting time once that once you call that that model is then saved to the model store the model store is configurable again this can be an s3 bucket it can be your local hard drive it can be Redis if that's where you like to store binary data some important files that are generated during this whole process you have your requirements txt and your runtime txt runtime is your Python version so that's just going to have three dot 6.4 or what or 2.7 Laura's compatible with either because we have many old data science projects that nobody wants to port from 2.7 to 3.6 but we're doing everything in 3.6 now requirements txt will be a completely frozen set of all requirements that the model was built with all the way down to the minimum version of the most dependent package so that you don't have to worry about if somebody else checks out your project will they be able to run your model because you know flask is upgraded some minor dependency that's no longer compatible and they don't know which version of flask you were using so this this actually solves a huge class of problems it's definitely a best practice for us you've got your database configuration which you can use environment variables to interpolate depending on the system that you're running in which connections should be available inside of there it's your typical Python config file so that's the any format which just specifies string name equals connection parameters or you can go the more verbose route of string name and then you can say host port IP user password whatever there's also a AWS config files there which will take your I am role things like that if you use AWS services which we rely on pretty heavily at instacart you'll notice that when we cashed the the original get data response it actually creates this file on disk which it uses hashing on the sequel file to generate a unique key at that point in time for what the sequel is it dumps all of that into a pickle so that if you then call that back from the disk cache you'll you'll get a near instantaneous result Pikul is actually surprisingly one of the fastest Python serialization and it's not that surprising since it's the native Python serialization but you know you see all these benchmarks for hdf5 and things like that these much more niche configurable serialization formats pickle the I should caveat in 3.6 with the latest pickling version is actually incredibly fast it's basically doing an M map direct to disk and then back out again what you'll you'll see is that when we trained our model when we called fit that generates a fitting number of one every time you call fit you will get a new new model serialized to disk that will represent that fitting so you'll have a bunch of monotonically incrementing numbers here you'll have a bunch of results files that are generated because this is a deep learning model it actually has weights which are in that wonderful hdf5 format that I love so much oh and you've also got the pickle which will have everything from your encoders and other various bits about the model what we're we save a bunch of statistics about the test results and the parameters I loss and how the strains are modeled strained over history so that we can graph those later and visualize a bunch of characteristics and of course we have logging logging is near and dear to my heart it solves a lot of debugging problems it's amazing to me how many data scientists don't don't really take the time to set up great logging for everything that's coming into their system and going out of their system in development of course this goes to your localhost in production this goes to our logging provider and then we have all of this credible searchable stuff the cool thing is that the logging that we get from training is configured exactly the same way as the logging that we have running at inference time so that we can we can watch our training logs come in we can watch our inference logs come in speaking of inference that's the next slide so this is great we've we've got a model and we've saved it to our hard driver to s3 or somewhere the whole the whole point of having a model is having a service that runs in production that the rest of your application can call and make prediction and so that when an order is created that can make an API call to this model we can predict whether it's going to be fraudulent or not and we can take some action in the application at runtime in real time some models don't have real time requirements some models have you know patched nightly so the only thing they will do is they will call fit every night and then they'll call predict on the whole database every night and we'll dump all of the predictions to the database and that's it so that that's great you use redshift or you use snowflake we're using both in production now so we need to add a few more components to this model so that we can do inference in real time specifically we can go back to our extract this is a newer feature that we're playing with which is Jinja to templates in our sequel what we want to do it inference is we want to scope this query to a specific delivery we don't want to pull the whole database we don't want to connect to redshift which will take minutes we just want to pull the most recent up-to-date data from Postgres and since redshift and Postgres are have snowflakes all have very similar dialects this works surprisingly well in practice we create our template from Jinja - that has this extra predicate on the end we'll take our original yet data from our pipeline we'll expand that so that it now it takes a delivery ID parameter and if there's an extra delivery ID we'll need to pass it to the template we'll also want to use Postgres as our data store of choice because now we're just pulling back a single row whereas previously we were pulling back a lot of data from redshift so this this if you pass a delivery ID to this get data function we expect millisecond response time if you don't pass a delivery ID to this function we expect it's like 30 minutes run time for the redshift query to return enough data you can this is just a little temporary variable to hold the template interpolation and then we'll just pass that to the connection which in these are the redshift Postgres and it's the same data frame call and this time we're gonna pass it raw sequel set of a file name there are lots of more options you can pass here and you can always directly pass interpolation variables to sequel which get interpolated safely whether they're strings or numbers we need to make one final modification to our model it has this predict method which is provided by default but now we want to predict based on a delivery ID we don't want to predict based on the default which is a full data frame typically we will pass in the full data frame of all of the data so our model inference time might not even have a database connection we'll just provide it with all of the data over the wire it'll take that data in as a data frame it'll run it through the estimator it's a prediction in this case we're going to use the pipeline to get the same data from just for that delivery ID from the database and it's going to be using an identical query which is ensures that we don't have feature mismatch between training and prediction time so that's that's a really hard problem to solve sometimes you can't always do this sometimes it's just not feasible you need fact tables in one that you don't have in the other but when you can do this it it's follows to keep it simple stupid principle pretty well so once once you've done that you've added your delivery ID parameter runtime you can fire up a lower server which is just a little flask app it will it will automatically generate endpoints for all of your models you can call predict on them you can pass them whatever parameters you like and it'll give you an answer at this point you've you're pretty much ready to go into production with your model this is a pretty tight experimentation loop once I've got something like this running I can go add you know I think our model is up to 160 features in production now to detect fraud but without a system like this that I can continuously rerun my query continuously evaluate the performance launch new instances into into tests and training it takes a lot longer to iterate if I'm waiting make one sequel change wait two days that didn't work one more sequel change way to days so having sort of continuous deployment continuous integration and cloud scalability and deployment for multiple branches in two different scenarios before I actually choose one to commit to master to push and deploy has greatly accelerated our development time for machine learning models at instacart it's basically the finalized picture of everything I think I've covered most of the most of the boxes there just to go over a few of the Transformers that we've built in-house that we find we use a lot the geoip is cool it uses max wines database you get all kinds of geographic data about anybody who's visiting your site you can calculate distance like we went over date/time and string manipulations are critical we use these all the time to generate features from whatever happened to be stored in the database we can generate more interesting transformers like I want to extract the domain from an email you can have a fairly complicated or simple reg X to do that I think the official one is three pages long or something extracting the area code from a phone number these are all fun things that make good features and machine learning models but if you don't have them easily accessible it's kind of a pain to extract that area code from a free text phone number that may be internationalized it may have parenthesis it may not and so a lot of the time you know data scientists will skip these kinds of minor feature engineering improvements which is unfortunate so we like having a big library we use US census data that we can statistically predict the age or sex or whether you're like a name like mom in a phone book is a pretty good indication that you're closely related to that person so there there's just a ton of information out there that when we have a full library we can do much more advanced feature engineering much more cheaply encoders are the key building block that I told you about they're stateful representations that we used to do feature engineering I heard somebody asked earlier about one hot encoding that's interesting because for some machine learning algorithms you have to use one hot encoding when you're with categorical variables for others like Karis they have a built-in ability to handle that in a much more performant way on the GPU so we have a unique encoder that will take as many categoricals as you need and depending on whether you're using a Karass estimator or an extra boost esta cater or sight scikit-learn what your machine learning model actually needs if it needs it to be one hunting coated and expanded into that sparse format it'll do that if it doesn't then you saved a lot of computation things like the glove and encoder I don't know if you know about the glove at word embeddings but these are the ones where if you have a word like King and you subtract the word man and you add the word woman then you end up with the word queen so these are more cool feature engineering that you can do so we can actually transform all of the words in your input into their glove embeddings know which can be really useful when you're dealing with things like product names and etc we we support everything in scikit-learn next you boost and caris and tensorflow right now we're open to adding more libraries but this is these are the primary toolkits that our data scientists a machine learning engineers use an insta card so that that's pretty much it for me thank you guys very much question any questions for Matata thanks that was really cool you guys do anything around handling horizontal scaling once you deploy to production and how about like a be testing of your models to make sure that your new models don't perform worse than your old models so one of the things that in cover is that every model has a predict it hasn't evaluate and has a score function the score function is something that you're expected to override and if your score is lower than the previous training run you will not be auto deploy to production so there's there are things like that typically right now when a data scientist is doing hyper perimeter search this is another thing that we support it's not fully parallelized yet this is on our roadmap so that you can you know you want to test and hyper parameters you'll fire up 50 boxes will go run all those 50 concurrently collect the results back but that's not ready yet we we do have support for multi-gpu in in terrace which is available which can it's it's difficult to say that that always provides linear speed-up with GPUs I think that the truth is it depends on the exact characteristics of your data set and your model but there are two different strategies that we support you can toggle them with flags and see a software engineering skewed workflow you know it was clearly designed by someone who knows Ruby on Rails scaffolding and unit testing built-in etc I was wondering how that ends up interacting with the data scientist your organization who in my experience often are more resistant to frameworks like that and won't work in integrated environments like our studio or super notebooks and whatnot yeah you're exactly right so Jupiter notebooks are a big part of our data science workflows those are fully supported and lore a big part of it is that as soon as you type more install it will pull down all the packages it will build you a virtual mm Python it will install that in your Jupiter notebook you can fire up your notebook and have your data science workflow that you want to have at the end of the day you're expected to take the results of your Jupiter notebook copy and paste those into these cells once you finalize that and this is something that we've worked with a lot on our data science I'm a software engineer at heart going back many more years but I think that as long as we we let them open up the black box and we give them these windows that they can completely replace an estimator with a completely custom tensorflow or pi torch implementation I guess that's the hot one now I got a feature request last week oh so that's cool and we're happy to do that it's pretty easy it's pretty seamless all of this ends up in a very standardized way as long as you conform to a few key milestones which data scientists have been like fine I'll do that for you kind of a follow up to the last question so the encoders idea is really cool do you have any kind of tools or tricks to speed up the extracts part or like to encode that so I am probably the weakest eighth engineer in this room I I write my redshift query and I expect that it works we we have an amazing data engineering team who could probably give you these tips and tricks of use a disk key or make sure you have your sort key right but for the most part we write a lot of sequel we tune it by hand we test it against our systems we have experts in-house that because we are in a monolithic repository every every sequel statement that gets issued against a database is tagged with the code line number that it's coming from and so our database monitors will then flag high CPU usage queries and then you'll you'll work with an expert in that system on what are you actually trying to do and how can we do this better sometimes the answer is like well you have to do a whole bunch more engineering you have to put this in Redis you have to pre calculate features that's hard that takes a lot of work we try not to do that we try to do the simplest thing that can work which is have really really large Redis or snowflake deployments how do you use how do you ensure that the data that you use for training from redshift and radar that you use for curing scoring from Postgres or identical and to how do you do you ensure that the feature encoders that you use for training and the feature encouraged use for scoring are also the exactly the same given that you use two different languages one with Python for training and probably Ruby for scoring no so all of the scoring happens us all intersect much first all the scoring happens in python using the same encoders it's all it's all one thing it's pickled it son pickled it never changes and so that's why the flask server gives us an HTTP API we actually have our own RPC service that we use that goes over RabbitMQ internally but flask is so that we can make this a nice external open source project your first question can you say that one more time okay so what happens typically is our Postgres database is replicated in many read replicas which is sort of our first line of attack that then gets replicated directly to redshift or snowflake and so we we try to use identical tables that have had one-to-one replication that's dangerous because what you may have historically in redshift so if if something gets updated after the fact like let's say that we notify the visit IP address whenever there's fraud the model will now look at all these fraud that'll see no I P addresses like oh this is so easy right but then at prediction time so that's something that you you just have to be careful about you have to do the data science diligence and typically it's really obvious when somebody comes seem like I can detect 99.99% of fraud I was like well maybe we should you know verify this a little bit more before we roll it out in production everything we roll out gets a be tested that's the ultimate answer of checking the quality of something if it it doesn't matter how good your loss looks in any graph if you're not actually impacting business metrics so at the end of the day we can roll it out in an a/b test and see see if it works and we typically do a/b testing we have blog posts it's very hard to a/b test logistical businesses that involve large integer programming problems where you have to come up with a universal optimal solution you can't solve half of the universe at once so that's interesting but that's we generally scale up our a/b tests from you know 1% exposure to 10% over time and roll out that way all right Oh more questions so cool system I'm curious about the speed so like how what's the fastest you can kind of deliver predictions in with because you're serializing pipelines in did I get deserialized and kind of stick in memory or yeah so the the timers and the loggers were actually the biggest bottleneck on speed until we measured our timers with more timers so you know I have a 300 line query that I run against our Postgres read replicas takes about 22 milliseconds to aggregate all of the data for a single fraud query that's the average if a user has a long history can take up to a second we're lucky because we can implement a timeout there because if users have long history they're much less likely to be committing fraud so there are some cheats that you can you can play with I think that if you're talking about deep learning inference can be super fast again you're talking about in milliseconds 50 milliseconds I think is the prediction speed of a model that has about 10 million parameters for this particular problem and that's actually an ensemble with XG boost which has a few hundred trees so all in all this is these are well below the one one second cutoff that we typically reserved for RPC calls out during a customer transaction our RPC setup oh it's not common I think that one day we would like to open source it it's a Ruby and Python RPC service that works over rabbitmq so it does both pub/sub and it does both direct calls so U and Q your call into rabbit and you can either have listeners that listen for that published event or you can have consumers that will consume that event and then post a reply back onto rabbit and then it will be reverse consumed for RPC call so it's a nice universal you
Info
Channel: Data Council
Views: 1,468
Rating: 5 out of 5
Keywords: machine learning, instacart, instacart data science, montana low instacart, machine learning development to production
Id: J8OJSN6l_PI
Channel Id: undefined
Length: 48min 37sec (2917 seconds)
Published: Tue May 29 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.