Exploring the FastAI Tooling Ecosystem with Hamel Husain - #532

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] all right everyone i am here with hamil hussein hamill is a staff machine learning engineer at github hammel welcome to the tuomo ai podcast thank you for having me i'm excited to be here i'm excited for this conversation and particularly because we've had a couple of reschedules due to technical issues so this is a hotly anticipated conversation on my part uh it was anticipated prior to all that but uh really looking forward to digging into the chat let's jump right in and have you share a little bit about what you do at github yeah um yeah thanks for asking so at github i've been doing a couple of different things so i spent a long time at github doing open source work so uh i worked a lot on fast ai so github sponsored me to work on fast ai for a large period of time and then i also did a lot of work with github actions integrating those with different data science uh pro open source projects like great expectations jupiter uh kubeflow so on and so forth other than that i've been working a lot internally on our ml infrastructure so those are kind of like different flavors of things currently i'm on paternity leave so i had a newborn two months ago uh so i technically have not been doing anything uh work related for a couple of months i won't get uh i was gonna go down a rat hole and ask you how your sleep was but voila it's actually surprisingly not so bad compared to my first uh child so oh that's awesome that's awesome um i guess i wanted to uh i want to kind of start out with the the fast ai work that you've been doing and how that came about but um it may be useful context for you to share a little bit about your background and kind of how you came to work on the infrastructure side of machine learning yeah so i started in data science a long time ago around 2003 or so when i graduated from undergrad and back then i was working as a as a statistician predicting a loan defaults for a large bank after that i went on to work on work in management consulting and doing like a data science flavor of things i had to brief hiatus where i tried out a different career which we don't have to get into right now and then i went back to the technical consulting and then i decided after doing so many data science projects at different companies and different industries i knew that tooling was like really far behind because uh people always struggled with like deploying models monitoring them doing things in a systematic way um you know there was there's a lot of open source tools but not really good like systems and so at that time in 2014 or so i uh i was happened to be living in boston at a time and at that time and there was a startup called data robot which was building uh ml tooling uh specifically is auto ml tooling so they got a bunch of people that are really good at kaggle uh you know three or four different uh grand masters all of which were like number one at some point and decided to bake in a lot of their best practices with regards to modeling into a product and i thought that was really interesting so i joined them i learned a lot there about you know how to create software for data science uh spent a lot of time you know talking to this kind of implementing these systems at different uh companies that are trying trying to use it trying to automate some some of their machine learning processes and then i ended up moving to the bay area because my wife she was doing medicine and she's in a fellowship program so we had to move to the bay area and then i decided you know i would like to experience one of these uh bay area companies that keep hearing about like it's some kind of like uh different you know from the outside you know silicon valley looks like a really different and kind of amazing experience uh you know when you hear about it like at least like when you're yeah when you're living somewhere else or you're not part of that so i thought i have to like experience this so i joined airbnb as a data scientist shortly after coming to the bay area and that was a really interesting experience when i found the airbnb like so when i first got to airbnb i thought it would be really advanced in terms of like ml tooling ml infrastructure i thought i won't need any tools and infrastructure you know they already probably already have those uh yeah because it's you know it's silicon valley and so uh when i got there um you know i just the first project they gave to me was this uh model that forecasts ltv and they said like hey can you just review this model yeah lifetime value yeah for marketing for growth marketing and they asked me to kind of take a look at it you know see if it could make any improvements on the model kind of like your first you know getting your feet wet into the place um and so actually this particular model was a guy who ran an r script on his laptop he sped out coefficients and he copy and paste the coefficients into an excel spreadsheet which had like formulas that would like materialize these coefficients into a sql query they then copy and paste it into airflow and it like really blew my mind and i thought wow like there's i made it to silicon valley yeah i thought i made silicon valley and you know working at a very celebrated tech company and it's like wow okay like there's no ml tooling at all whatsoever like it's the most bit you know it's kind of you can't really imagine anything more basic than that and so then i started to build tooling and airbnb um and then uh you know kind of got back into tooling and then eventually then i created a lot of things a lot of like artifacts that ended up being used for bighead and then after that i ended up going to github which is where i'm at now and then i started working on infrastructure there and tooling there ended up working on a lot of open source stuff with regards to tooling and so at this point i'm pretty much sold myself or convinced myself that i will be tooling doing ml tooling for the foreseeable future uh because i keep somehow floating back into that no matter how hard i try to do anything else at github what are the the main kind of use cases that ml is being used for nowadays i mean there's stuff that we hear about like copilot uh but i imagine that probably most of the the use cases that you know github ml engineers and data scientists are working on are kind of you know internal types of use cases and then maybe not dissimilar from the kind of thing you did at airbnb supporting growth and kind of platform health and that kind of thing yeah so there's a bunch of different use cases at github some of them you touched on so a lot of forecasting of various things in terms of like infrastructure usage um also there's a lot there's a lot of platform health stuff like detecting spam detecting lots of various kinds apparently people like to buy stars things like that you know like you know catching things like that or catching abuse of the planet well that was a thing yeah i didn't know that was the thing either it seems like a pretty low point in your life if you had to wanted to go by stars um but uh and then there's some there's actually some user facing stuff that are are not known so widely so if you go to github.com explorer you'll see kind of a recommendation of different repositories you might be interested in based upon your activity on github and then also like another example is if you create a new repository um you can attach topics to it like different tags and so there's a small recommendation system that will recommend other tags you should apply so there's some things like that um so these are the kind of things that data scientists work on inside github and they're still it's pretty it's a pretty new thing for github yeah it's not really it's something they're they're actually building up right now it's it's in its most nascent stages i would say and so with that in mind what does the the platform look like at github yeah so it's been changing quite a bit so we were on aws before and then we had kind of our own homegrown infrastructure that had that was composed of using a bunch of different things like kubeflow and some other stuff and then we got bought by microsoft and then we started transitioning everything over to azure and using azure ml and so azure ml is kind of like is the managed service with the azure provides and it kind of uh is white labels a lot of ml flow stuff as well so as with regards to the experiment tracking the model registry and things like that it has your ml more or less adopts that behind the scenes and so that's what uh that's what they're using and so how did your work with the fast ai team and the fast ai tools generally come about yeah it was really organic so the way that happened was when i first got to github you know i think github is still trying to find like kind of get a sense for what it wanted to do with ml um there was definitely a lot of prototyping and sort of exploration of the space of like different ml projects but it wasn't quite solidified yet and so in the meantime what i did was uh kind of do work in open source to kind of you know just for fun or just kind of see you know what you know and also look for opportunities for things that might integrate with github so at some point you know it's always uh been a like a student of jeremy's like from a long time ago and so one of the things that uh we i one of the things i ended up learning like a long time ago in one of his classes was about how to uh do semantic search so he showed something in one of his classes where he had uh like photos of different objects and you could search those photos semantically using natural language but also like you know even if the photo you know if you uh construct a shared web vector space you know you can create a semantic search and so this is back in i don't know like 2017 or something it was a little bit of a newer concept and so uh i decided to try that out with code so i thought that would be really interesting and at that time like no one has really used github's public data set before um it wasn't really for whatever reason like i don't know mlp ml people have like ignored github data set maybe it just wasn't obvious that you could get it and do something interesting with it so the thing that's really interesting about github data says you can you can actually construct a really interesting parallel corpus out of it um with regards to you so you can get a bunch of your code so like for example python code you can get pairs of docs strings and code it takes some work you have to like get the code clean it and you have to parse it out you have to parse the doc strings out from the the code and like you know and then remove the lot you get a lot of duplicates and do all that stuff so i mean that's a tremendous amount of work but you can actually get a very interesting parallel corpus and then you can do a lot of representation learning on that code and then you can produce some things that are really interesting so so then i started to work on that in open source uh and then what eventually happened is we got about by microsoft and then and there are some people at microsoft research that are interested in that we ended up using like fast ai in that and that was one of the things that i think jeremy was really excited about like one of the things that uh you know like an example of uh use of uh this library in the in the you know kind of in the wild um so at first my this is my first involvement and then at some point github released github actions and so uh then i started going around to all these open source repo projects like jupiter great expectations scoop flow so on and so forth and started making integrations between github actions and these various data science projects so like for example there's a there's a project called great expectations this is like testing for data quality so i thought okay it would be really interesting to do like ci cd for data quality like let's say if you're like updating a sql query or something like that like it wouldn't be cool if you had a test uh that could run you know if you you're you know changing sql code in a pr you know you could trigger that test things like that so like and then i started helping jeremy out with his ci cd and then one thing led to another and then start working on more and more things like once you start working on someone's ci cd you have to like understand like all their code how it runs like you know how the tests are on what they mean what's breaking you know uh and then i started helping with this documentation yeah even uh going back quite a few years they've been doing pretty interesting things like integrating in documentation and code and doing that all in notebooks as opposed to kind of traditional code files so i imagine you'd have you got sucked into that and that led to you know some of the things like mbdev and fast pages yeah yeah yeah no that's a good that yeah so fast forwarding got stuck you know i went to a big rabbit hole on all these things that you described because then i started wondering like well how is all this stuff created like it's not that clear to me because actually you know like they're doing a lot of interesting things using a very uh unique tech stack and so um then i went and learned that tech stack and then got really involved in things like mbdev and fast core uh and started working on that and then subsequently work on a bunch of other projects so yeah it's just very organic kind of went into a you know you you see one thing get interested in another thing and at least sometimes pulling threads yes yeah yeah so what is nbdev yeah mbdev is a development environment that all the fast ai is built in it's it's a literate programming environment so the idea is you should be able to write your code in uh your documentation and test all in a single context and not have to write you know your your code in a different place your test in a different place in your documentation and yet a different place and keep those all in sync yourself but like that should be more natural like hey write some code explain it to yourself into your users and then have like inside the documentation have some runnable code that explains like how this code runs like by example um and then you know also make those tests if you can as well but make it very natural like a conversation with your users and and have that be run every single time your code you want to you know update your library and have those tests always run and so basically what mbdev is is built on some of the things that jupiter for the interactive aspect of like writing code but also being able to write prose alongside your code so it integrates that with a static site generator that renders documentation and then also some other tech things that export notebook code to plain you know like modules like the python modules like that you would write in a id like bs code so it's a system that like glues together these things to create a new development environment there's this literate programming environment and it's kind of hard to explain like in this abstract sense like i would definitely say something to experience because when someone explained it to me i was really skeptical i was like well this sounds like jupiter yeah you're like well what do you mean like mbdev you just write everything in notebook and that's not really what it is it's it's it happens to be using a notebook but that's not the central point of it the central point of it is like it's an experience where you write yes you are in this specific case you're writing in a notebook but you're writing your tests and documentation in the notebook with some special sugar in syntax that's available to you with for various options and then that all gets that automatically like gets crea like documentation gets created and tests will get run nci for you automatically and the result of that is like much higher quality software and also like much lower iteration cycles because what happens is okay it's like most people don't write documentation and most people don't write tests and why is that because like writing documentation sucks like you write your code and then now you're asking the developer to like well just go write this other documentation somewhere else and then keep that in sync yourself and like that's a pain like you know no one does that properly like very few people do that properly um and same thing with tests so like it's really like a new it's a new development workflow and it's like one of the secrets behind how fast ai like you might be wondering like how does jeremy and like maybe one other person depending on who's on fast ai at the time develop all of the software and one of the secrets behind that is mbdev because it helps make you a lot more productive yeah like i mentioned it does sound uh quite a lot like jupiter or maybe jupiter with some uh you know annotation types of things that let you say hey this is doc this is code this is test uh but it sounds like it's it's more of a system than that um you mentioned literate programming a couple of times you know what is that and how does that play into what they're really trying to do with envy dev yeah literate program is this like big concept um i don't know who invented it um you know i have to look that up but it's this concept where like your programming environment should not be dead like it shouldn't be completely static you should be able to see the result of your code uh in how that changes you know inputs and outputs in real time as you're as you're programming and not have it be yeah like and be able to experiment on the fly and also like pros and code should be able to be intermingled naturally because you want to be able to like talk to your users and talk to yourself and document your code um beyond let's be on let's just say like comments necessarily like and you want to be able to kind of have this expos expository like form of programming where you can kind of show your code in the same context as like writing code and yes it sounds a lot like jupiter but like it's jupiter like with some other things to allow it to go the whole way because like jupiter in it by itself is not necessarily like take you doesn't it doesn't allow you to do software engineering completely the way you might want to do it like it's not necessarily the best suited id just to create like python modules in you have to export that somehow out of a jupyter notebook if you're just using jupyter and then it's not necessarily straightforward like how would you write tests in jupiter and it's not necessarily straightforward like how you how would you create like yes it looks like when you if you write a really polished jupiter notebook it can look a lot like documentation within like how do you actually like create documentation out of that in a systematic way in a reliable way and then give users a lot of different options like how to control that documentation because like you said like you might not want to show all the code in the documentation you might want to show certain things like how do you do that um so it's kind of like you know that friction you feel like and everyone may have may feel this like you're in the notebook you're developing some code and then a certain point you're like oh my god goodness i need to take this code and make it into plain text and you refactor it and you're going back and forth a lot and i think everybody's like a little bit frustrated in that process you know intuitively like can there be a better way but we have all learned to ignore that and just like say well that's just part of life so mb dev is kind of this answer to no like we shouldn't ignore that like let's try to find a way so by by no means envy dev perfect it's kind of like the best kind of thing that we can have with gluing together existing technologies um but i think you know someone could definitely take the concept of mbdev and build something from first principles uh that that may work even better uh but yeah mbdev is kind of like this thing that just can work with just like kind of hacking some stuff together right now from the description it sounds like yes it's this tool that the fastai team built to allow them to build the fastai library but it can be used more broadly you know by anyone for anything is that yeah absolutely yeah but yeah i mean it's not just for data science at all in fact i feel like it's been used for more regular software projects than it has been for anything related to data science so um there's been a lot of like api clients written with mbdev uh there's been all kinds of other software that's been written with mbdev and so yeah i really think like it's a window you know of course it only works with python at the moment um but i think it's general software development it's a way it's a it's a development environment for general software development when i think about some of the activity in the space you know i think of things like uh netflix's and compass like these and in fact this was probably after your time at airbnb but at some point they were kind of going down the path of trying to like productionize notebooks a lot of people have have made attempts at that does would you say that mb dev is in the the same vein um you know in some ways it's it's more like what fasta i was actually trying to do was like write books and courses and you know things like that more so than production software uh even the the the framework the library um you know they weren't necessarily they were authoring it they weren't necessarily trying to productionalize it in the traditional sense of the word is it is there a productionalization or operationalization aspect to nbdev at all as far as mbdev is concerned it is it's all it is definitely about you know making proper python modules and allowing you to i mean it is definitely all very much about productionizing software in terms of making python modules and packages pushing them to pi pi making sure you have good documentation and good ci um because like when you start an mba dev project it automatically stubs out like ci for you so that it's kind of already there and so it's very much geared towards nudging you towards productionization now as far as fast ai is concerned um i think one of your observations is definitely correct like i don't know that festa has been overly concerned about productionization of of applications with fast ai like like you know just relative to other tools that you might see so for example you know like tensorflow has like tfx and tensorflow surveying and stuff like that so it's definitely that stuff does not it's not that those kind of things don't are not there for a fast ai um and you're right like one of the things that was like really important in fastai is to be able to have really good documentation um and also like good tests uh because especially like with the service area and then like only being like one or two contributors full time and so um one of the goals is like okay like the the docs have to be really good because after all it is a edu you know this is used for education and the docs are not good you know people are going to get lost very quickly but then you know the the problem is with docs is like how do you keep those in sync with the code within the code is changing so rapidly because it the fast day changes quite rapidly you know it's they're always keeping up with state-of-the-art things and making their own state-of-the-art things and so the kind of the answer to that was like hey let's make documentation first class citizen like you should be doing it while you're writing software it should be really natural there should be no friction in introducing documentation and so that's kind of exactly what yeah that's what mbdev allowed yeah and so what are how do fast core and fast pages relate to envy dev yeah so um so one background of jeremy is he has programmed a lot of different languages prior to python he i think he says like in you know he says sometimes that he's been programming every day since he was i don't know what it is yeah i was laughing because uh it's probably been a couple of years now but in our in the twiml community we uh you know very early on kind of recognized the you know the importance of what he was doing with the the library and hosted a study group around the course and i remember kind of vividly you know my early experience with the library and and other people's and i remember uh making a comment on our slack like why does he name these parameter names like this why does he name stuff like and someone was like yeah he was a pearl contributor back in the day and i'm pretty sure i asked him about this in a in one of the past interviews with with him um but uh yes he's been programming for for quite a long time and and and he's um you know he draws a lot of inspiration from uh kind of classical computer science like the um the literate programming is don nuth and um there are a bunch of other folks that like he's a big fan of apl yeah is you know took a a half a semester or a quarter of a semester in like a survey course in school and it's like the only thing i remember about apl is that it had the weirdest keyboard you had all these symbols uh you basically you know programming in you know greek symbols uh with apl and like jeremy's super into that um so uh yes yeah no and this shows up and so like you know um and he's not exaggerating when he says that because i've spent a lot of time pair program programming with him and you know he actually spends a lot of time hacking in different languages even right now and so he's what that ends up what ends up happening is he whenever he uses a new language he always like creates like a very he tries to hack the language deeply and create a bunch of utilities that bring in like other aspects of different like paradigms so like you know one thing he misses about uh python is like he well you know he wants more functional programming tools or he wants more like macro like meta programming abilities and things like that um and so you know that's what fast core is it's like hacking python you know deeply to give you like some more functionality or some different types of functionality that you know you might want and then he ends up using end up using that everywhere but um this is pretty like so if you just read fast ai code you know a lot of it may look pretty foreign not only in syntax but in style uh you know like you said like you know so there's like some things like succinctness which you know he really values he really likes to keep likes to keep uh lines of code like make things on one line and the idea is to like you could see all the code in like one page or as close to one page as possible or something like this and so um that's where fast score comes from it's a very interesting so like to understand it was a deep rabbit hole like to understand fast ai deeply i had to understand the development environment which is mb dev then also then when you start to look at the source code you have to understand like this fast core to understand like the source code and then like yeah it just goes from there so like it's pretty interesting if you're if you're trying to like understand more about the python programming language like a little bit deeper level it's pretty interesting to look at fast score to see like all the things you can do and then try you can like get some insight on like how some of these things are done it's pretty interesting uh it's a way to definitely learn more python even if you're like using it every day yeah yeah i remember having that experience going through uh through the course like you know hacked a little bit with python or you know it kind of your typical python stuff right um you know as far as maybe calling a name function or whatever to get a list of methods or something like that but you listen to jeremy kind of talk about python and work with it and he's you know using all kinds of exotic dunder functions and stuff like that that like okay yeah you didn't know that existed it's really interesting like and then you might wonder okay like well why is this even a good idea like or why why does this like does this does it actually make this person more productive the costs and benefits of these things and so one thing i will say is like a lot of this like actually like is like is in service of some kind of learning as well like doing this whole thing like is a journey of like also continuously learning the python programming language at a deep level and uh he engages he does that like really soon like anytime he's frustrated with anything in python he'll stop say okay can i like change it whereas like someone like me would be like okay like well just let's just let's just move on let's just do it but then like the thing is like it adds up really fast like then he ends up becoming really like he ends up knowing a lot about python like really fast um even like very esoteric things i would consider esoteric like he'll know it like and so you know i think that adds to the productivity productivity component um but you know it does have the cost of like newcomers okay like if you want to contribute to a fast ai library you kind of learn this other thing and you know so the rabbit hole is deep it sounds like yes yeah very deep yes yeah so that's fast core fast pages you know one thing that is really important is i find as a data scientist is writing blogs like it's it's really useful to like share your knowledge and so you know whenever i've written blogs before um you know i used to use medium and like you always want to you know a lot of times you want to put code in your blogs but then you know the process of writing is not linear like you start writing it down you're like oh you know actually this code like i don't like it anymore like maybe i'm doesn't really make the point that i was thinking or whatever and you change your code then you have to like copy and paste all your code again into this thing and then like update your words around it and like update the output and you're like this is a big mess you're like copying pasting constantly and then you're like you you realize like why am i doing this like like i'm a programmer why am i copying and pasting like charts and graphs and things into this thing so i can like write a blog post like doesn't make any sense like if i'm changing can i just like change all of this stuff with code and then you also realize like hey there's like jupiter like i can write work i can just run a jupiter no it's like isn't a jupiter notebook like a blog but i can't why can't a jupiter notebook be like a blog like it's pretty obvious like and then i started looking around like how do i turn a jupiter notebook into a block and actually there was not a good answer for that it's like well you know like it was like very hacky like nothing really worked very well there's some like here and there different things i say look i just want something that just like i save a notebook somewhere and it just becomes a blog and then i have some ability to like hide cells and show cells and do some things like that and then so uh i took some of the ideas from mbdev with regards to how it renders docs and i said well why can't it just be a blogging platform um and then all the conversion process and all that you know can you just automate that with github actions can we have like triggers that say okay when you update something in your repo it just re-renders the notebook and reprocesses it and makes it into a page and so that was the general idea of that it was just like making it easier for you to write your blog as a notebook i've always appreciated how weights and biases does that they have a nice implementation of being able to kind of blog your experiments and and things like that is it in a similar vein yeah uh yeah i really like weights and biases too one of my favorite tools um it's kind of it's similar to that i'd say weights and bias is more of a you don't really have to even you don't really have to write any code really if you don't want to you can just start typing like a google doc and you know put some put stuff in there so i think that's really at a lower level yeah i said lower level it's like you make a proper jupiter notebook and you save it um weights and vices the visualization layer is is a bit different um they're not really basically using like python in there it's like some other syntax uh you know and you can use vega and something like that to create custom visualizations and they have you know it's kind of this middle ground yeah so yeah it's it's a bit different but similar kind of kind of a similar genre i guess and then you mentioned uh github actions in well you've mentioned it a few times but you know most recently kind of as a part of the the fast pages process you know talk through github actions and some of the ways you've used them to support data science yeah um so the idea is like can we use like can we have ci cd in various like data science workflows like does it make sense so for example so why are we talking about weights and biases so one integration that i've made is something that will you know ping weights and biases for experiment tracking results and bring those into the pr and render them in the pr so that you can view them and then have a discussion so the idea is like the example that i showed was you know you make a pr against some modeling code i don't know if you've seen prs like this but i've seen a lot of different prs that where someone makes change to a model and then you know what the the review is like hey like what happened does it make the model better and the response is yeah it makes it better you just merge it but like you know that is broken like we know that's broken like we can't do that we can't do have this like you know this like uh you know hearsay conversation about code like yeah it should be something that's very objective like more objective and so ideas like okay like can you bring your experiment tracking results into the pr to you know accomplish to like bring more visibility into the results of a workflow and there's a lot of different um nuances there like i don't want to just give you the impression it's just like something is triggered on every push or something like that just like normal code like machine learning is a little bit different so the idea is like can we is there an integration that makes sense and those are the kind of things i worked on uh for example um there's another like there's another thing where um there's a project called repo to docker that takes like that takes a any repository like data science repository and dockerizes it and this is for the purpose like this is what binder uses like if you try to go to binder and like give it a github url it will do is like it'll give you a jupyter notebook but it'll try its best to like build the dependencies by introspecting your repo so you don't need the docker file there you're just you know giving it a typical python repo maybe it has a requirements txt file and it's going to figure it all out yeah it'll try to guess like it'll you know has a hierarchy like it'll look for requirements.txt file if it doesn't find that you know like look for a condom file it doesn't find that it'll like look for something else and look for like these things and like you know it also supports docker files or you know if you have that it'll look for that or do whatever but you know a lot of people don't have that stuff like a lot of data scientists just have requirements txt or something and you want a reproducible environment so say okay can you have github actions automatically build that for you and deploy it somewhere um so that's being used uh github actions you know and so the get up action is interesting because like you can pre-package them and like make them modular so like this weights and biases thing like you don't have to like you can just call the weights and biases action in like say okay like you give it like some like three or four parameters to get it working you don't have to like use all the code that i created for to to ping the weights and biases api similarly like for this is jupiter example you know you just give it like whatever parameters like let's say you're trying to push this thing to a docker repo this image that you automatically build you just give it like your docker credentials and off you go you don't have to like so like that's that's the kind of power of github actions is like you don't have to worry about the complexity of these things you can just use them in your workflow without it so high level you've got uh various life cycle triggers on the github side whether that's you know code being pushed to something or a new comment in a an issue uh and then the action itself kind of encapsulates integrations with other things and so you can basically hook all your other things into various stages of your of interacting with github yeah absolutely yeah that's that's a good explanation and then um so some of the ones you mentioned did you mention uh actions for great expectations yeah yeah yeah i did yeah yeah so you can say you can let's say you have like a sql file in your repo you can set up an action to trigger like if you change the sql file uh you know let's say and then you know you want to validate the data that is like emitted by the sql file or maybe it's even a table definition that you have in your github repo you can have that validated by great expectations and then you can have the action like tell you whether it passed or failed the expectations check if it fails you can actually have it place a link to the dashboard that great expectations produces and you know things like that so just make it a little bit easier for you as a data scientist like when you're doing something to have it more integrated you know where it makes sense and maybe kind of wrap things up do you have visibility into kind of the direction that jeremy and the team are headed with fast api i'm wondering like you know what you're excited about there or um you know what you're looking forward to taking on once you're back from paternity oh yeah that's a good question i haven't you know i've tried to stay true to my paternity leave and try to not pay too much attention to the outside world it's been hard uh but i actually don't know to be honest with you like what necessarily they're gonna focus on the most i would actually have to talk to jeremy yeah to ask him about that to be honest um what uh what what was on your list of things that you were kind of looking forward to and excited about before you left do you remember yeah so i was actually looking so one thing i've been really like interested in is like the explosion of different ml tooling out there and how this space is going to evolve yeah uh and you know um and then also like i was looking into like because like uh you know i haven't mentioned that github is moved on to azure and using azure ml i was actually looking at all these different alternative workflow tools for ml um and so i was exploring a lot of them and one of the one that i was most excited about so far is metal flow uh from the netflix team so uh yeah that's one of the things we've been playing with most recently so we'll see maybe i will do something in that with that project in the future nice i had i interviewed recently villa tulos who is the the founder of that he's no longer a netflix now he's kind of doing a startup based on metaflow so um you know i'm sure there'll be lots of opportunity to dig into what they're up to there yeah yeah there definitely is and uh yeah i've been talking to vla as well with a great interest and so yeah i think that's a really you know i'm really excited about that project and what in particular uh out of curiosity uh you mentioned all the projects that are out there there are a ton of projects out there one in particular about that one uh catches your interest yeah so a lot of projects that are out there so like before this i was actually involved in kubeflow um and and so and then i've also used ml flow uh because of azure and things like that and so one of the things that really impresses me about metal flow is where how they beat the user where they are and so what i mean by that is for example kubeflow kind of tells the data scientists hey you should learn kubernetes yeah um you know they may not say that explicitly but that is that is definitely like lurking in the room right and ultimately like that makes it can make it very difficult for adoption it's kind of like saying hey like i want to drive this car but i need to have a mechanic sitting in the passenger seat next to me the whole time and like no i don't want to drive a car like i cannot drive a car like that like can i just drive a car and so um you know i think a lot of the netflix infra uh like that that you see you know it's kind of different like in terms of you know they try to meet the user where they are some might argue they may try to meet the user too much where they are maybe like with this notebook infra i actually think that's really cool like maybe some of those things are like experiments to try to see how far they can push the envelope either way i really like the the whole notion of meeting the user where they are and thinking about what like the user experience looked like and so like their apis and their workflow the design of the api is very intuitive and it doesn't require the user to really learn some like galaxy brain thing like kubernetes um and i think that sounds really simple but i don't see that really any other that many tools do that now ml flow is is there too but i feel the ml flow is behind in a lot of their features versus metal flow in this regard um and i'm not a big api something that i don't find as intuitive um and so yeah that's it i mean this is like you know it's a it's a good it's an open source project that seems like has a good traction i really like their api i like their philosophy of meeting the user where they are um yeah and so that sets them apart so yeah i'm excited about it awesome awesome uh well hamel it's been great catching up with you um excited we finally got to record this conversation and uh you know thanks for joining the show all right thank you for having me thank you

Info

Channel: The TWIML AI Podcast with Sam Charrington

Views: 208

Rating: undefined out of 5

Keywords: TWiML & AI, Podcast, Tech, Technology, ML, AI, Machine Learning, Artificial Intelligence, Sam Charrington, data, science, computer science, deep learning, machine learning infrastructure, machine learning tools, github, github actions, hamel husain, fast.ai, jeremy howard, nbdev, fastcore, fast pages, airbnb, bighead, mlops, ml platforms, open source, azure, microsoft, datarobot

Id: H5CiZZHPipg

Channel Id: undefined

Length: 49min 28sec (2968 seconds)

Published: Mon Nov 01 2021