Building Blocks of Machine Learning at LEGO with Francesc Joan Riera - #533

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] all right everyone i am here with francesc rivera francesc is an applied machine learning engineer at the lego group francesc welcome to the tuomo ai podcast thanks and and thanks a lot for having me as well son i think it's a pleasure and super excited about the talk tonight as well same here same here and and thanks for uh taking the call at night or being on at night uh it's a bit later for you you're in denmark yeah that's correct denmark uh well as you know the headquarters for lego was born in the small town of billund here in denmark so living just across it very nice very nice why don't we get started by having you share a little bit about your background and how you came to work in machine learning and at lego yeah um so well i think it's uh it's not going to be a very long story right because i've been on the i guess on the market for roughly three years now but i think my my enthusiasm for ml and actually i should say my enthusiasm for computer vision started back in my bachelor's in industrial electronics in spain and that's just because i was starting the bachelors in electronics and then the last semester was focusing on robotics and then robotics we had an introduction to computer vision and i really don't know why but i thought this is damn impressive it's super interesting what you can do well in in reality with mathematics and then matrices right for all the pictures and pixel values and all of these sort of things and then that drove me to them actually catching up on on a masters in denmark so that's when i moved to denmark to do a full-on masters in computer vision and machine learning in the university of oldwork which is here in the north of denmark and after that well i guess i become sort of an expert in the manner i hope so at least and that got me then yeah i got a small job as a software engineer for a couple of months it was not my thing and then i found this explanatory level where i got to actually work with ml on on active products running on the cloud as well awesome awesome well speaking of uh industrial robotics and computer vision one of the early i think this was uh i think this was early many years ago uh demos of ai was like somebody built a lego sorter using uh like a treadmill kind of thing and uh a paddle or some kind of robot that would like sort the legos into different pieces did you ever have you ever seen that one it's funny you mention it um i we are trying to undig that one from the internet to maybe exploring the options also to use it as a product for us oh really uh so it's quite funny that you actually mentioned i think we talked about it two three weeks ago so oh that's that's funny if i remember correctly i i remember reading a hacker news thread uh when this was published again i think it was you know probably like five years ago or four years ago or something and uh there were apparently a bunch of people that would like go on ebay and you know buy these gigantic bags of miscellaneous legos and they were talking about using you know these kinds of devices to sort them and then resell them like apparently there's a a bit of a uh cottage market so to speak in in remarketing legos yeah and it's also one of the big campaigns right that we are also running is even though lego is primarily made of plastic right so you want to give your bricks a longest life ever right and i mean if you had if you had some lego set from the 50s or the 60s you could still probably use it today so i think that's also what drives this enthusiasm right on you being able to all right let's try and make this you know circular economy in a way right for for the brink yeah yeah uh well i brought that up not thinking that it was something that you were actually thinking about but um you know maybe we can jump into how ml is is actually used today at lego yeah so maybe i can just start presenting slightly my area so my team going from i guess the the bottom to the highest so i'm i'm part of a team where we are three ml engineers and then uh full stack and and then all of the surroundings right and in my team we have two main products which i can just quickly describe now we can go into details a bit later if you want as well so the main product that started also now two two and a half years ago approximately it's a moderation service so as you know lego is of course a brand for aimed at children right and children are our main responsibility so of course all of these for example social media applications or games or anything where children can upload their content let's say images text videos anything that can be shared and can be created by them on the phone or on the computer needs to be moderated by law before it's actually available or online for other children to see right as to avoid the obvious things right and i mean we know facebook does that we know instagram does that a lot of companies do it but not in the level i would say as little as it um and that's because on facebook you could post something that is obscene and it's not necessarily going to be deleted as you upload it but it's going to be deleted after the fact maybe because somebody reported you right that cannot happen in our applications because the damage could already be inflicted and that's not it's definitely not you know the image you will give to the brand and so yeah so our product what what it does is receive the content created by the users and we do an initial pre-filtering of this content so on images from text and videos we try to reject already what is the most obvious things that should not be in the app like if you upload a very bad quality image if you show your your face or some sort of identifiable information you cannot also share that so we also remove that of course the obvious profanities and we are working also in extending a couple more detectors and detectors by detectors i mean machine learning algorithms actually so we do this pre-filtering what gets rejected by us just gets feedback to you saying hey we are very sorry but your creation is not allowed because of this try to take a new picture or try to be nice writing the text some some post different format if the content is approved by us it then goes to manual moderation so there's a company moderating every single piece of content before it is published so what we accomplish with our product right is what there's two things right one of them being monetary aspect because each piece of content that is very obvious does not need to go to a moderator that costs per piece right so in that sense what that's what we managed to prove um and of course if let's say you know there's it's like a i don't know a big fair with lego and people are asked to upload pictures because they are in the fair it's like you need to build a car and then you need to post a picture of the car so of course they would like to post the picture but if the picture was bad for some reason it would be nice that they get the feedback instantly rather than having to wait so when when it goes to manual moderation there is a time frame of five minutes i think it's five to fifteen i'm not 100 sure because it's not my part but the thing is our our service provides a response under 10 seconds for whatever type of content so if you uploaded something let's say you take a selfie with your car like oh we really like your car but you cannot have your face there so they get the response no seconds like oh yeah my card is still here i take another picture life's good again right um and then the second product we have um so after everything has been approved so when when your creation is live on on the social media app then we have an automatic job every morning that will take all of the the creations from the users and then again with two sorts of uh algorithms so one for image and one for text we try to identify what is the the theme the predominant theme on the on the on the creation so for example you have a big star wars you're a big star wars fan then you have a lot of star wars sets you take a picture of your star wars uh star wars set we probably will identify it as a star wars and then what we do is that okay this gift has uploaded a star wars picture or star wars uh album collection so what we do is then we send the like with the npcs on the application and we also use different npcs to send comments reinforcing them and encouraging them to keep building saying oh that's a very nice star wars creation you have i would love to see more from you and similar like that and the npcs are bots or characters and uh yeah it's uh they've already existed in the app um and they've been used so before we set up this environment for this product it was used by by people so me as a i guess consumer engager person or in the team i would go in i would log in as the let's say chiwaka and then i would comment out on a couple of things okay so now what we're doing is we are automating the job uh of course there's still some people looking at the comments and maybe doing more specific comments depending on on the scenario um but it's been proven also to actually give a lot of value because we received a lot of uploads saying oh look legolive emmet liked my comment i like my post and they are super happy right so it's it's also very it's very fulfilling for us as well to see that the reactions are positive right and you mentioned there are there are three ml engineers on on that team and how what's kind of the the broader size or scope of ml and data at lego well so back in march we had also a big reorganization where we then founded also what is now the department called the data office so my team resides within the data office and then the data science products but from the data science process for example we have a couple of others so of course in the lego.com page where you go to the shop right and you buy star wars set we have of course a recommended engine saying oh you like the dead star maybe you're also interested in the x-wing right so the recommended theme is also i guess our sister team working using data as well and developing ml solutions there's also two yeah two other product teams one working working on marketing effectiveness so for example if you have a new release of lego ninjago or netflix how is that performing compared to i don't know party or or whatever other cartoon is for for children at the time right and then the last one is demand for casting so trying to be ahead of the curve which i guess the team had a lot of fun in corona times that's very difficult now there was no curve it was just a peak and it's like well we are running running out of stock everywhere so that's it yeah and then we of course have an incubation team as well so the incubation team is trying to analyze different areas sectors and departments at lego where automation could be very beneficial for example using machine learning and one of the examples is this brick sorter using a treadmill well it's a bit more high scale but that's the idea right nice nice with the uh you mentioned the the moderation app it reminds me of a recent interview that we published um with someone who works in cyber security and you know in that environment as you can imagine it's kind of very adversarial you you know plug one hole someone's trying to poke another in moderation in your environment is it you know quite the case that there are bad actors as opposed to like are you are you is the task primarily identifying passive um passive behavior that you don't want or are there bad actors so to speak that are trying to abuse the system i think there's a there's a bit of both actually generally speaking we see the cases where you know maybe maybe you had the phone in your pocket right and accidentally you took a picture i mean it happens to everybody right and then you post a picture by mistake um and that happens quite often sometimes we also actually caught some bugs in in the app because it's like well is it normal we're getting i don't know a thousand black images or full white images you know no it should not make any sense it's like okay well maybe you should look into into the latest felicio you made um so that was a very funny effect but also and and that's maybe a bit above the where the product our product goes with the motivation product and that's more on the manual moderation because they do track behaviors and bad actors and as far as i know i think there's been a couple of cases where somebody got banned they opened more accounts they keep doing the same got banned again you know the game so there's a little bit of that i was curious uh if that was if that was a uh something that you had to deal with how that played into the way you approach the model development but it sounds like that that's um kind of downstream of what you're doing it is because it is again we we are not facebook or instagram right we need to make sure nothing bad gets published right so that's why there is the extra layers of security here do you incorporate the the the input of the human moderators like is it a kind of a human in a loop type of situation where you're using their judgments to uh evolve your models yeah yeah that's right and that's something we started doing i think at the beginning of this year maybe the end of last year you know with with the corona year everything is a bit distorted for for me in the calendar but um what we managed to to yeah link was this external moderation company so when they reject an image or a text or a video we get the feedback back saying okay this content was rejected because of this and what we what we did is we made a feature store and it's uh that's a very simple reason for making a feature store rather than just a database i guess it's that um the data by law so gdpr it cannot be stored forever meaning that let's say you are a you are a user of the social media and one day you call to consumer service and say hey i want all my data to be deleted we cannot have this link because we are in a way a standalone product right so we are not linked to the app uh meaning if you called consumer service you said i wanted to delete it your data is deleted but nobody is going to tell us hey this user called because well first of all i don't have your views ready for example so it would be impossible for me to track who you are but what we accomplish with this also is we generate features and the features are generated by well-known machine learning algorithms for images we're talking resonant 50 for text samples we're talking about multi-lingual birth for example so in that sense we get the images take samples we get the features of it which are anonymized because if you have pulling um pulling layers in a neural network right you cannot undo a pulling you you could but you would not get a one to one match right and then those features are saved and then labeled with the human in the loop got it so rather than saving the original images you're saving these representations from the these neural nets and the [Music] one of the driving reasons behind that is to um kind of avoid the gdpr responsibilities if that was personally identifiable data yeah and and whether it is or really it's not um if the user calls and once the data deleted it needs to be deleted otherwise yeah you're breaching the contract i guess yet i guess that's also within the gdpr meaning if they if they call whatever the consumer gdpr line is and ask for their data to be deleted it the these representations get deleted from the feature store not the representations because they are anonymized data so that's fully anonymized right there is no way to backtrack where who and where it comes from yeah yeah and so you said that you you described that as a rationale for a feature store versus a database but you could still put those representations in a database well yeah i guess we like to make a fancy word so i can just call it feature store rather than the database right yeah and i and i i say that to probe a little bit deeper into that and are there other um you know capabilities that you've built into you know this feature store that are specific to the the way you're using it for machine learning yeah well so as i said right so the features which is anonymized data gets labeled by by moderators in this case it's so actually we have labels for the moderation part but we also have labels for the theme part now and as i said before so the images that are approved they are published on the application right which i mean are open source or not open source but they are open to everybody because they come from a back end that publicly shares the images right so then what we do for this for example is that those rather than keeping the features because that model is a bit more complex we keep a pointer to where the image is stored in the database for the application and then we reuse that if the image was deleted by the customer well we don't have the image it's it's a pity but it's not there but it at least is not our problem anymore what we do with this feature store and actually something we're currently working on um hopefully to be done by the end of the year it's um we're trading pipeline plus ap testing framework as well uh with both actually so the idea is checking this feature store every now and then right okay we have enough features for retraining you know the personal detector right yeah we don't all right then let's retrain before we launch it live even though we think the model is better right let's have a maybe testing where both models are running in production at the same time and then it's up to of course us as developers but also part of the business side of moderation to decide all right the new model is better or no the new model is not better with the a b testing framework um i'm curious how you're kind of packaging and deploying the models yeah so um so for for a model experimentation and model registry for example we use the open source version of mlflow which is the i guess the package from databricks um we what we did is we created our own yeah model store with the mlflow backend and then within the mobile registry so we have the most in either production or staging or of nothing and in the core of moderation the code of moderation is built upon a state step function in aws so what we have in this step function is okay i have the version detector here and if i have a staging model for the personal detector then the task is a parallel task where the inputs will flow towards both so that i can have a one-to-one analysis to say okay which model is performing better in all of the images right okay and so that you mentioned uh step functions do you use um do you use serverless technologies pretty broadly in deploying ml yeah i think um and that's also one of the biggest changes with it uh i think this year it's been quite a quite a change here i think we we are a full-on serverless so everything is deployed with aws sum for example and then everything is a step function lambda we have also an ecs target running for the biggest model but that's generally it's all yeah you could say everything is serverless interesting and so i'm curious how your you mentioned it sounds like you're evolving the infrastructure um you've evolved it quite a bit over the past year with cove a year or so and i'm wondering if you can share kind of what the the prior state was and you know what were some of the the drivers for moving to serverless and and container service yeah so well as i said that so the moderation product started two and a half years ago and when we started [Music] we thought it would be best if we tried to manage everything ourselves which meant all right let's go full on lambdas and then rather than you know having step function logic or things like that let's go sq snss dynamos with streams uh if i had a picture of the old setup i think you would be scared and many people maybe listen also would be scared to see that image but then that evolved a bit into all right let's go with cs so let's go full on target tasks the problem in that scenario is that we in the moderation setup because the application i was going to say deployed i guess the moderation exists in 26 markets and we get there constantly all day every day it makes no sense to have a target task that is stopped and then has to be started when new data comes because there's data coming in maybe every two or three seconds um which meant that the target tasks are running 24 hours a day and then for the fargo task which is basically a docker container right to receive whatever that needs to be moderated is that we needed to then enable a queue so then that queue would get some data and then at some point in the threat looping right you would get the message um i mean it worked very well everything was perfectly fine and it was not not a big deal but then we we consider that whenever we get new employees new colleagues in my team it's hard to explain the flow it's hard to explain or or maybe yeah try to understand what is happening right and i think it was in december 2020 that aws came with an update no not awesome lambda came with an update where you can deploy your custom docker images to lambda and that just yeah that made our work much easier life was better that day um because what we did is well we have the lower images already let's just transfer them to lambda because we know how long the words lambda is amazing aws not sponsored i'm going to say now but having the docker containers running in london then we could also integrate it into the step functions without having to have task tokens waiting for the callbacks and a lot of complications right and effectively what this meant and what this means up to today is that today we can deliver responses under 10 seconds when we talk for you know many images with comments and and everything whereas before we were below the minute so we actually managed to reduce the average times from a minute to 10 seconds which is quite the accomplishment i think this one and do you run into um run time constraints for using lambda when you're doing image inference uh no no and and that's because so so what we do and as i said right so so the feature store right again and looking back to feature store so the images they are passed through a resonant 50 without the classification layer which is i think a very common approach right transfer learning you've already got the representation and you're just doing classification exactly and perfect and it's funny because uh there was this the tesla ai event a couple of weeks ago maybe a couple months ago where they mentioned they are doing uh yeah classification image classification with something called hydra and what is hyper i mean it's just a resonant network where then you have different heads which classify different things and i was like well this is exactly what we know yeah what did they think of a cool name like that before [Laughter] we've talked quite a bit about the feature store um can you talk a little bit about how that evolved or any challenges you have run into um in bringing that project to fruition yeah so the feature store was i think it was one of the biggest problems was um from idea to reality uh actually because we were like of course a future story makes a lot of sense right i mean if you want our models to improve you need data you need label data right and i guess that is the suffering of every email engineer right it's not about the data there's plenty of their own world it's not but it's not labeled right that's a problem so the ideation was there um but we were it took us quite a quite a period of time to try to figure out what's the best approach how do we do this best you know also future thinking because it's like well i mean i could just you know get all the data from the manual moderation and stack it up in some one drive and you know let it drop there forever which is what a lot of people and a lot of companies do with the data i'm afraid to say um but then we came to the realization and i think it was also reading some posts on other companies how i think maybe it was uber actually how they did feature stores in uber so then we came to that to making a feature store client everything we lose in python so it was a python as well when using the backend as aws because we are full-on aws but of course we tried to make it also cloud agnostic so that if one day we move to azure or we move to i don't know any other else then there's not that many problems or it can also be integrated right and i think one of the learning results we've got in the future story in the first steps is that all right i have features so let's say the feature right it's an image that went through resonance 50 so we go from a couple of megabytes image to seven kilobytes uh feature array right and i know it because i've seen many of them so i know it's seven kilowatts um we thought all right because we're going to get a lot of data i think on average we get around 10 to 15 000 images per day on the violation platform so you can think right so 10 to 15 000 images per day 10 to 15 000 features let's store them in s3 so we store them in a stream and then we have a catalog in nosql database which is dynamodb in amazon web services everything was fine the data is starting now it's increasing in numbers everything looks to be wonderful until the day we want to query the data for the first time because we sell that let's see how the querying works well it turns out that s3 is not the ideal scenario when you want to query a lot of small files because there is and i'm not an expert in in this sort of infrastructure things but there's all of these handshake things that come between requests and getting the data that was taken longer than actually fetching the date itself and because there's a lot of why was the querying against s3 and not the dynamo well the dynamo was a catalog so in in the dynamo you would be like okay this is the feature path on s3 and this is the label so you just need to get the data from s3 because that's on in our eyes you know it's where data is supposed to be stored it's in the packet right rather than a table so the problem that you ran into was just the latency for requesting a single file in s3 when your bucket had a lot of files yeah exactly um and also requesting a big data set or you know a big feature literacy right because i guess for a couple of hundreds of features doesn't matter but um i think one of my colleagues did the calculation right so it was i think 40 milliseconds per uh feature and i'm not exactly sure of the number now but um okay what we learned is that well we probably included something else so what we did is migrate the data itself as well to dynamodb so that the catalog because i mean the feature can also be converted to a bytes array so then the batch array can also be stored as yeah as nosql entry right so by moving the data into the catalog itself what we managed to do is that the data now is fetched 16 milliseconds per per feature which is you know halving the time and that had actually proved to be very efficient actually so is the the data community there at lego um you know is everyone kind of full stack or do you have your more traditional data scientists and then you know folks that are more ml engineers how does the what's kind of the the range of skill sets or culture there in terms of uh full stackness if that's a thing and full stack we are very happy we have full stack developer in our team because he is a mastermind in doing our frontend tool before we had him i think our front-end app was maybe something out of university you could say um but so we are split actually so we have uh like myself so we have machine learning data engineers you could say right we also have the more standard data scientists as well and then we have data management folks and platform people so that ones that are helping with the more standard infrastructure like selling us the account the the security and all the basic things that you need when you are an enterprise company right interesting uh we haven't talked in detail about the engagement tool what were some of the interesting challenges you ran into in developing that tool well yeah there's also been a couple right because when you come out fresh from university right you think data sets are beautiful clean and you know i think that's what a lot of people think about them because when you read research papers the data is just beautiful right and you have i don't know 16 gpus and all the money in the world apparently and well that's not our case even though it also goes well i guess monetarily speak in the lego but um i think one of the so the the main issue this is this all started so the the whole idea for this engagement tool started also maybe two two and a half three years ago and their initial idea was that we would get the data to then train uh only image so we started going with images to train an image classifier that could recognize teams then the constraint was we want this algorithm to be on device so what we did is we went from the smallest network known which is well at the time was net with two leggers and that's three five megabytes and that was too big for them so so then we had a problem because it's like well we we can try to go down we can try to minimize we can try to reduce weights between layers we can write all of these optimizations but um the results were already a bit clumsy so you know you maybe had a jurassic world built and i would recognize it as star wars so it was far from idea and actually the idea of the project stopped there until a couple of months after somebody said well why don't we try to do this team detection for actually interacting with the customers or with the users um because then they say well you have room to do the network you want we don't really care because it's just gonna run in the cloud and it's gonna run after the fact right so it's gonna run at some point in the day to just you know get the themes and send some likes and comments right um [Music] and that went very fine i think we started with five teams which is the top five categories also from from the application and probably the biggest issue there again was the data quality right so you can think of our application like the lego life app as as instagram right so when you're a user you take a picture of your lego build you will write a title and description say you know this is my cool creation this is a star wars c5 and you can also then add the tags hashtags like on instagram so those hashtags could be you know hashtag star wars hashtag a lot of other things so because this data is available in in the in the social medias back-end in this sort of you know you could say clumsy labeling we said all right well we could try maybe right in the first iteration trying to crawl all of this data with some specific themes and maybe also even keywords we know are used for specific things right so let's say okay if if a user has published a star wars image and says this is chiwaka i know chiwaga is a star wars minifigure well i can probably make it in my data set saying it's star wars so that's how we collected data the first time for five classes and it worked wonderfully well there were also a lot of misclassifications but that was also expected from us and we tried to clean so we tried to i think we did a couple of iterations on the keywords and the high stacks we used as well because you could think if you said a spaceship spaceship could be star wars but it could also be ninjago could also be city there's a lot of different spaceships that's the problem with lego right there imagination is the limit right so it's not the real world unfortunately or fortunate it makes it a bit more fun but yeah so then we got the data the first model with five classes worked it was accepted by the the business side and after that happened then we extended to ten no nine classes and then now we're up to 13 classes and i think the latest issues we've been having are not issues but the latest challenges we had to overcome was growing to 15 classes and now we have a three terabyte data set of images which you might think well it's not that much you know there's like images and it has 15 million images i guess it's quite a it's quite a branch when it's the first time we work with such a big data set for a production ready uh solution right and and using sagemaker we also learned a lot of sagemaker because we don't have on on premise gpus so well i have it here but i have it for for gaming on my free time and so we also had to to do some learning on how to utilize sagemaker the best possible and to be fair that also was very helpful when we also got support from them from the enterprise side because then yeah so we were using sketch maker notebooks which is just a full-on jupyter notebook it turns out that even though you cannot uh even though you can choose a gpu instance for the notebook it is not the recommended scenario the the training on the gpu has to be done on the sitemaker training step the notebook is more just for data prep and data visualization well we didn't know that until of course when i we had to run uh these 13 class training actually the 9th class training we did it on the sagemaker notebook and i think it took what did it take a week maybe a week and a half whereas for the 13th class we then got this estimators running on the training jobs and that took i think halfway so more data less time it was a good trade-off i think got it got it um [Music] very cool very cool what are the kind of what's next uh at lego with ml what are you excited about or looking forward to i think i think it's a very open question right um because again so we are a company that makes a lot of i guess out of the ordinary things right um and one of the projects for example we've been looking at is is the it's a taxonomist right so let's say how many cards we have how many motorbikes how many dogs how many cats we have with lego right and you could say well we are lego we own all of our data we know what we have and that's true but what is well who's telling me that tomorrow there's not going to be a a unicorn with uh with a fishlegs for example right so this makes this taxonomy a bit complicated right because how do you evolve an algorithm or a set of algorithms that can get a new class that has never seen before it does not look like anything right because you can have a fish and a unicorn but if they are combined who's the winner here how do you make sure that your algorithm can learn the new class that's coming and it's something we're looking into and i think it's hopefully coming next year and it requires a lot of prep work that's for sure um but there's also a couple of other things so we're also trying to to upgrade some of the of the a4 which is a full is an adult fan of labor so all of these adult oriented experiences we're also trying to put on a bit more spice to them for making it all exciting for the adults to use saying oh you can have your this specific application here can you recognize what you're building things like that right so trying to interact with the users as well but it um it's it's complicated i think right because you can have very big dreams um but it's always it's always bureaucratics right and it's always about the data right it's like well sure you can ask me to to classify all of the red lego bricks you have but i need to know what a lego brick is first right and what is red right very good very good uh well francesc thanks so much for joining us it was great learning about your projects and especially how you've built out the platform and infrastructure to support them uh at lego yeah and uh well thanks again for for having me and i guess maybe one last uh learning you don't need to go kubernetes you know you can always walk [Laughter] awesome thanks so much yes thank you you
Info
Channel: The TWIML AI Podcast with Sam Charrington
Views: 146
Rating: undefined out of 5
Keywords: TWiML & AI, Podcast, Tech, Technology, ML, AI, Machine Learning, Artificial Intelligence, Sam Charrington, data, science, computer science, deep learning, Lego, machine learning infrastructure, LEGO Group, LEGO, human in the loop, AWS, Tesla, MLOps, applied machine learning, Francesc Joan Riera, content moderation, mlflow, Recommendations, Marketing research, Demand forecasting, Predictive sales, serverless, step functions
Id: WM1LpFTfFjk
Channel Id: undefined
Length: 44min 41sec (2681 seconds)
Published: Thu Nov 04 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.