Jupyter and the Evolution of ML Tooling with Brian Granger - #544

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] all right everyone i'm here with brian granger brian is a senior principal technologist with amazon web services brian welcome to the twimla ai podcast hi sam thanks so much for having me looking forward to jumping into our discussion uh you are a co-founder of project jupiter and that is of course a topic that we will be digging into in this conversation but to get us started i'd love to have you share a bit about your background and how you came to work in machine learning and we'll make our way to the the founding of jupiter as well yeah i definitely can walk through that so i as you mentioned i'm a senior principal technologist at aws and i've been here at aws for three years coming up in february before that for the sort of decade and a half uh prior i was a physics professor uh most recently at cal poly san luis obispo and then before that at santa clara university uh even though as a physics professor most of the time at the university i built open source tools for data science uh machine learning and scientific computing and so i i have a background in theoretical physics but that's sort of evolved over time through software engineering uh building tools and more recently i spent a lot of time on ux design and research nice nice and so how did jupiter come to be yeah so it's a it's a fun story uh it's if you rewind back to the early 2000s uh linux was really taking off python uh had been around for a few years but it became visible in the scientific computing community and a classmate of mine at cu boulder in grad school fernando perez had started to use python in his research and uh he's really the one that introduced me to python both he and i uh during our physics uh education used mathematica a lot and we even though we were doing computational physics in uh other languages we weren't necessarily using mathematica we had always missed the notebook interface that mathematica had and in 2001 fernando released ipython which is an improved and enhanced command line repel for python that had some of the ideas from mathematica in it although it didn't have the full notebook interface and so in those early years i started to play with python a bit fernando was working on ipython and then in 2004 he visited me in the bay area i was a young professor at santa clara university and while he was there we spent a lot of time talking about uh computing what we were doing in our research how we were using these tools and it was it was really then where the vision of creating a web-based notebook for python came into focus and there was a couple different factors in that one it was something that we wanted to use in our own research we have found over the years that the sort of interactive computing in a notebook based interface where you also have a document was extremely useful and we just wanted it to exist the other dimension was that by 2004 rich web applications were starting to appear and so it started to make a lot of sense to us at that time that if we were going to build something like this it should be entirely web-based now that was 2004 it took us until 2011 to release the first version of the ipython notebook and some of it was us learning about this space you know we were theoretical computational physicists not web developers we wrote a lot of code as a physicist but very of a very different nature than this obviously and the other is that modern uh web technologies even in 2011 were relatively primitive and from 2004 to 2011 we were really waiting for modern web technology to catch up to what we needed and even in 2011 state of the art at the time was jquery in bootstrap websockets had just been turned on in all the browsers and we were using all the the latest stuff in 2011 which now seems rather primitive compared to what we have uh today when we think about jupiter or talk about jupiter today uh it's often in the context of ideas like literate programming were you thinking about it from the perspective of literate programming and and some of the theoretical foundations of you know why a tool like this makes sense or is it um you know kind of strictly a scratching your own itch and you know trying to bring to python this interface that you loved in mathematica the connection to traditional literate programming came much much later and i i think we in and so the phrase that we use uh there's sort of two phrases that we use one would be uh literate computing rather than programming okay uh and the other is distinction so so in in the traditional literate programming paradigm there's nothing interactive about it right there's no you're not actually as a as a human interacting with the live code as you would in a reply you just write a source code file and then you use the literate programming tool to compile that to the actual source code file that can run there's nothing interactive about it and in scientific computing and data science and machine learning that interactive experience of writing and running a bit of code seeing what the output is and having a stateful process that holds the state of the program in memory that you can then write more code against and so that's where we've always thought of as literate computing or interactive computing another phrase that we talk about a lot is the idea of a computational narrative and again the the focus there is on a narrative that's contained in a document but it's not about mere programming as in typing code it's about actual computing it's about writing code and running it seeing the result of that and using that to think about data in the case of machine learning so i think you got us through 2011 2012 all these new web uh tools kind of catching up to to what you needed around that time we also saw an explosion in machine learning deep learning interest how did that shift uh what was happening with the jupiter project yeah so in the in the early years when we thought about what would success look like for us in building this web-based notebook for python i think our our sort of market if you want to to phrase it that way would have been uh academic researchers who are using doing scientific computing that's that's the universe we were in at the time those were all the people we were talking to um and every time all through the early 2000s whenever we talked to people in industry they looked at interactive computing with a bit of a sort of oh that's nice but we never need to do that um i think fernando even had a a conversation with guido van rossum who was the the creator of python and when when fernando described how we use python interactively guido said wow i i always thought of the python repel as being a bit of a toy that no one would actually use for real work it's amazing to see how in the scientific computing context you live in in these interactive shells um and so we we had always thought that the sort of the commercial adoption of these tools would be very slow at best um now you know it sort of came in from the side that around the same time commercial entities discovered the power of data through data science and machine learning in a way that they hadn't before and so they ended up needing these same tools for interactive computing and they quickly discovered uh jupiter as one of the tools that they could use for this but it took i i remember in the early so starting in 2011 going up to maybe 2015 was an amazing time for us in that it felt like every month a new major organization discovered interactive computing and jupiter and python and by that i would say by the time we got to 2015 2016 uh it was a lot of people a lot of companies were using these tools as well but it happened it was a transition that and growth that we had not seen coming and definitely hadn't planned on and it created a lot of challenges for jupiter as a project was it obvious that you should embrace these new use cases and new user communities or was there a bit of tension between kind of staying the course and you know building the thing that your scientific computing users needed versus you know things that machine learning users might need it to the extent that those have are divergent in any way yeah so i don't want to imply that there's never been tension there i think the the i don't want to speak for all jupiter contributors but i can try to summarize some of the sentiments in the community um we even today so today jupiter is used by many large corporations and many contributors to jupiter today work at those large corporations back in 2011 it was close to 100 percent academics working on project jupiter and even today i think the core jupiter team deeply values the role of jupiter in research and education we recognize it's important in the commercial space we definitely want to address those usage cases but i think broadly speaking the jupiter community holds the research and educational usages and with sort of special importance um and so there there has been some tension there now the other question you brought up is the potential for divergence between the needs of the academic users and commercial users i think to first order the experience we've had is that there's no substantial difference in the the needs of those communities it's even to the point where it's a little humorous in the sense that even still to this day we regularly talk to organizations that will tell us how they're using jupiter and and start out by saying you know the way we're using jupiter is really weird and special and probably like unlike anything you've ever heard and then they'll tell us a story that we've heard hundreds of times about they're doing all the same things everyone on the planet is doing with jupiter that's not to say that they're not different differences for example in the academic context the requirements of teaching are really unique and so when you're teaching a large class in a university with notebooks being able to manage homework assign homework that involves notebooks grade the homework involving notebooks and so we've built special capabilities for things like that but at the core i would say most of the functionality is common across all jupiter users got it got it got it and so you're now at aws and you continue to work on jupiter how did that come to be yeah so as i mentioned by the it's somewhere in maybe 2000 by 2015 2016 commercial and enterprise usages of jupiter had really taken off and what that meant is that the jupiter developers were talking to a lot of users that were no longer individual users but were people maintaining and operating large-scale jupiter deployments in their enterprise and the pain points and struggles that they were having sometimes were related to jupiter and and would be either bugs or feature requests or enhancements that we could make on the jupiter side in other cases the challenges that they were running into really were not a jupiter problem or or challenge it was more of a deploying and maintaining cloud infrastructure type of problems a great example of that would be entities that have some manner of private or sensitive data and need to deploy jupiter to meet a certain compliance regime such as hipaa and that's not really a a problem that jupiter's an open source project is going to solve in an end-to-end way jupiter may offer building blocks that can be used to assemble such a system and and so that was one of the for me personally starting to talk to these enterprise users and realize okay there's there's a huge need here we're probably not going to be the jupiter open source community is not going to fully solve these needs i'd love to start to work in an organization that's really good at all those enterprise cloud computing challenges that was one dimension the other dimension was uh us thinking about the long-term sustainability um the as a result of the adoption in the the enterprise commercial space the user base of jupiter grew far far faster than the size of the development team of jupiter i think the the jupiter user base has been growing exponentially uh with a doubling period of i don't remember it's about a year since i think 2015 the jupiter contributor uh population has not grown exponentially and we've grown a lot and so we've really we've really faced a resourcing issue where we just as an open source project have not been able to keep up and uh i love by the way that you said it was growing exponentially and actually meant exponentially no i i can i can show you the plot where i'm getting this from is we have a a chart that we periodically gather and that is the number of public notebooks on github that's what there's a number of different ways we measure how big the jupiter user base is but you can see we have a chart in this repository in a notebook that shows uh the growth of notebooks public notebooks on github and it's been growing exponentially since i think somewhere around 2015 um so yeah and so part of this was the jupiter governance model has always been multi-stakeholder uh we've we've designed it and some of this came out of our own needs fernando was at berkeley i was at cal poly so almost by definition there's multiple stakeholders and organizations involved as additional academic contributors came on board and then eventually contributors from companies came on board we embraced that multi-stakeholder nature and i think from this perspective my moving to amazon was an opportunity to bring a new stakeholder into the mix of improving and making jupiter sustainable and growing open source project and the governance model is it's jupiter's part of the num focus foundation is that correct yeah no focus is a 501c3 nonprofit that's the umbrella organization for a number of open source projects in this space uh that would include numpy scipy pandas senpai jupiter and dozens of others i i know i'm forgetting uh there's more than i can possibly name so jupiter is one of those uh jupiter in in num focus each of those open source projects has its own governance model it's not like the apache foundation where there's a single governance model that everyone adopts under the foundation and we actually in the jupiter side have been refactoring and designing a new governance model over the last two years to address the scope and scale of jupiter and we're so we've been rolling that out incrementally over the last year and have still doing work to finish that up but the core idea is that it is multi-stakeholder and we're trying to build checks and balances to include cooperation and a vibrant community among all these stakeholders so you've talked about the kind of what's in it for jupiter in finding uh kind of an enterprise cloud home if you will um what's in it for aws and and you know how and why does aws invest in jupiter yeah this is this a great question so the the story of jupiter at aws actually began before i joined and uh so as aws was diving into machine learning and data science the question came up of what do we do our customers are asking us about notebook platforms what are we gonna offer for that and a decision was made before i joined to embrace uh jupiter and it really came from feedback from customers uh the the leadership team and the org that i'm in the the aiml org at aws but we spent a lot of time talking to customers understanding what they're doing what their pain points are what existing open source technologies they're using and uh like i said even before i joined we heard the the resounding course that people were using jupiter and they needed help deploying jupiter in a in a secure cost effective way and that they wanted actual jupiter they didn't want a notebook like solution they wanted real jupiter this is also something that made it possible for me to join aws and very attractive to join aws i didn't need to start at the sort of very beginning and argue and make a case at aws for why jupiter why why should we ship jupiter versus build our own notebook that was already done and settled and so since i've joined aws it's more been a question of how do we at aws make sure that jupiter continues to be the best uh notebook platform and a vibrant and growing open source community and at aws they're you know across different projects there are different approaches to engaging with open source communities um it sounds like the the way that aws is engaging with with jupiter in terms of or rather the way that aws is incorporating jupiter into its projects is to try to stick close to kind of the core jupiter as opposed to forking it off into something else absolutely yep now one of the things that and this gets back to a technical and architectural principle that jupiter has used since it's its founding and that is jupiter at the end of the day builds lego building lego pieces for notebook platforms or for interactive computing platforms and the idea is that enterprises and organizations can take those building blocks and assemble them in different ways and at aws that's exactly what we're doing we're taking the open source building blocks and assembling them in a particular way to serve the needs of our customers and so we may build and we do in fact build additional things on top of it using the various extensibility apis that jupiter has so for example our machine learning ide at aws is sagemaker studio it's based on jupiter lab but then on top of that in the sagemaker team we've written a bunch of jupiter lab extensions that add machine learning specific capabilities to make it an end-to-end solution for machine learning and those those extensions that we're building wouldn't make sense to be in jupiter from an open source perspective uh they are a lot of them are are specific to aws jupiter as an open source project works really hard to have a essentially vendor neutral perspective so if you look across jupiter's different code bases you're not going to follow find a lot of code that's specifically tuned to a particular cloud platform or deployment context and so so in at amazon we're extending jupiter we're building additional domain specific capabilities on top of that but anytime we're looking at either bugs or enhancements to the core of jupiter itself our approach is to work with the jupiter open source community and and contribute back those changes and so we have a a dedicated team of of engineers it's a small team but they're 100 focused on upstream contributions to jupiter with the goal of of making sure that jupiter continues to be the the best notebook platform out there and that that is it and it's again not just about the software but it's also about the community the open source community uh and so we're we're participating in that open source community in a way that we hope makes sustainable and growing inclusive and diverse uh zooming out a bit i'm curious how you think about the the broader ml tooling space and you know jupiter is just one piece of what a data scientist or machine learning engineer might interact with to get a an idea for a model from that idea into production yeah how do you think about the that broader that broader space yeah and so for this i'll i'll start with the the landscape of products we have at aws for machine learning under the sagemaker umbrella and what what we're seeing from customers is that there's a bunch of different tools and capabilities they need to go all the way from begin the beginning of the machine learning workflow where they're preparing data importing data all the way to building models and then deploying them and then using them to make predictions whether it's in an uh in a product through a an api or make predictions that are more consumed by humans and something like a dashboard and we've been building the the at aws and sagemaker for the different parts of that machine learning workflow in close collaboration with customers as a you know one of the aws leadership principles is customer obsession which means that we spend a lot of time talking to customers and understanding what they're doing what their needs are and and so all the different things we're building in sagemaker our response to what our customers need now with that said i think that the challenge that's emerging more broadly is one of complexity that if you look at all the tools a large organization would have to string together to cover and span the complete end-to-end machine learning workflow to have those tools address the different personas that are participating in machine learning whether it's data engineers data scientists ml scientists mlaps engineers etc there's just incredible complexity and the complexity is along a number of different dimensions there's fundamental complexity in the data uh people are working with there's complexity in the algorithms they're working with there's workflow complexity and then there's the reality that this this nice picture that we have about the machine learning workflow that starts from importing data to exploratory data analysis to date appropriation to model training to model evaluation deployment it's never linear in practice right someone starts working with a data set and the initial questions they're going to be asking are very basic such as what's even in this data what might we predict what business questions might we predict with this data and they may get to the end of that initial pass and discover we're not even close to being ready to building a model that we can deploy and make predictions against we have to go back to the beginning clean up the data gather more data join it with other data that we don't have available and then they're going to come back and do that again and maybe this time they get a little further and start to feel like okay we may be able to predict this let's dive in and see how far we can push it in terms of the quality of the model we can build and then after that they may have to look at questions around uh bias and explainability and understand okay we have a model that that's performing well but can we use it responsibly and ethically and they may have to do another cycle through all of this and this iterative nature and then the complexity of the overall workflow i think is something that we at aws and everyone across this entire industry is just starting to grapple with i'm hoping that 10 years from now we'd look back on this stage and say wow we've made incredible progress really on the side of ux design and human computer interaction that our systems will evolve to the point where they still have these capabilities but allow human workers who are using the tools to have a much simpler experience and i think that's that's the main challenge we have right now digging into the user experience and hci aspects of this i know that's something that you are very passionate about and spend a lot of time researching what you know what you know have we learned or what have you learned and have applied into jupiter or you know what's kind of changed the way you think about jupiter or um you know what from those spaces do you think will kind of impact the the way that you you know build these tools and and guide these tools in the future yeah there's a number of different dimensions here um i'll maybe pick two two of them to talk about briefly one is that uh in organizations that are doing machine learning and building machine learning tools you're pretty much guaranteed that by definition they're engineering heavy right you if you look at these organizations you're not going to find a lot of ux designers that just naturally work into the process so that's the first challenge is that is sort of the weight of engineering required to build these systems and use these systems is massive and even in even in organizations uh like sagemaker in the aws ai ml org we have been very deliberate to build out ux design teams and yet if you look across our organization we're still very engineering heavy because the problem requires it to be so and so just the weight and momentum of engineering uh presents presents a lot of challenges to prioritizing the human experience of these tools and so a lot of what i'm working on right now at aws is building mechanisms to help us include the the consideration of the human experience in this and it's a lot of fun to be diving into that but certainly very challenging and and i think the the key is that at least at aws it's rather new to build tools where the human experience is so primary and important and it's it's a growth area for us and something that we're spending a lot of time and energy and investment on to improve uh in this space the other dimension of this is i think that there's very few situations where we uh as humans have tried to design tools that are this technically complex and what i mean i'll use an analogy here i'm a car nerd in addition to being a data nerd and if you think about how people approach car design if you're designing a hatchback for a mass-produced market you can have ux designers come in and look at the human needs and those ux designers will not need to know much about the technical implementation of that car right they're not going to need to know about that they can work with engineers who who can handle all that and it will work wonderfully if on the other hand your job is to design a formula and racing car anyone involved in that product i mean if you want to think of it as a product has to have an incredibly high level of technical knowledge for example let's say you're the the the ux designer who's designing the steering wheel for an f1 racing car you need to understand what are all the technical capabilities that the driver needs to have at their fingertips what are the principles of on the human side the human factors that would enable a driver to manage dozens of buttons driving 200 miles an hour right how on earth do they do that they're going to have to glance down in a split second to find that button to change the brake bias or to change how the engine is tuned and the designer going through that has to become an expert in the technical details of that platform they're going to need to know about tire wear brake bias when do the drivers need to use these things and this is i think another fundamental fundamental challenge is that the designers who are helping to design these tools need over time to get that technical expertise to understand these technical users and what they're doing with the code and the data and the tools we're building in kind of talking about the the first of those the first of those directions for incorporating user experience into product the recent canvas announcement came to mind um can you talk a little bit about the way that user experience design went into that product yeah absolutely so at re invent this year we launched amazon sagemaker canvas which is a tool that enables business analysts to train machine learning models so these are users that that spend a lot of time working with tabular data sets and their focus on answering significant business questions with tabular data sets maybe excel spreadsheets they may have data in relational databases and running sql queries against them and what we're hearing from customers is that these business analysts often would work with data scientists or machine learning practitioners who can build models but there's never enough data scientists and machine learning practitioners to support the business analysts and so the vision of canvas is basically let one of these analysts import a tabular data set and then pick a target column we suggest what type of prediction is relevant whether that's uh classification or regression and then we train a model and enable the business analyst to quickly make predictions and it's a no code interface and what's exciting about it is it used the same underlying platform of sagemaker so the models that are trained in canvas use sagemaker autopilot which is our automl service and so the analysts when they train a model can then hand it off to the data scientists who can then do additional work on that model as needed for example if there's multiple model candidates that autopilot has suggested the data scientists can come in and help the analyst at that point figure out what for this business use case what is the best possible model at that time and we're what we're seeing is that canvas enables these analyst users to focus on the business questions that they want to answer and then understanding what types of things they they can bridge using machine learning and the yes even the the name of the product kind of elicits this visual approach to building machine learning do you see that as extending beyond you know what what canvas is today which is you know frankly a very simple uh approach to solving that you know relatively simple problems the question that i would come back to is is for these analysts what are they doing on a daily basis that where machine learning could help them and how do we make them successful in doing that and today canvas does have some data preparation capabilities but it's not as sophisticated for example as the what data scientists would do in a notebook or what they would do in a tool like sagemaker datawrangler which is a a low code uh data preparation tool we have in sagemaker studio and so i think we we have a question in canvas right now around or i guess it's more of a hypothesis that the analyst personas don't need to do heavy duty data preparation but now that we've launched a product we're going to get to figure out how much how much data preparation do they need do they want to do it themselves do they want to be assisted in doing data preparation by data scientists at this point our hypothesis is that they don't want they don't need to do a heavy-duty data preparation and uh and a lot of this you know this is not just sort of a a wild guess but we spent a lot of time talking to customers who have analysts who would be using a tool like this and that's our sense right now you know and so you know i think part of what you're asking is how might where might canvas evolve to over time i think that that's one question we have another question is the role of collaboration between the business analysts and data scientists we have uh collaboration capabilities built into canvas and sagemaker studio to enable this to happen i think our our hypothesis is that these users do need to work together time will tell tell us more about the nature of that collaboration and what what additional things customers need yeah i'd love to maybe spend a bit talking a little bit more about collaboration and the way you see collaboration kind of taking place in the context of the machine learning workflow in general and in notebooks in particular i think when we it's easy to look at notebooks and what they you know taking you from this ide in a terminal or a terminal that's kind of you know landlocked to your computer to a uh a webpage that you know could be anywhere and offers the idea of or the possibility of collaboration um it strikes me that that while that is a a natural idea for notebooks it's under implemented maybe i don't see it being used in that way as often as you know i might expect it's like the the promise of a google doc but you know everyone just uses it as a regular word processor and i'm wondering you know what you observe about collaboration in the ml process in general and the the way you see that applying to you know tools and notebooks in particular when we released the the python notebook in 2011 users started to work with it and began to open issues on github to give us feedback and within a very short period of time i think it was a month or two one of the earliest feature requests we had was for real-time collaboration similar to what you would have in something like google docs and it has continued to be probably the most significant feature request we have from the jupiter community and and so we heard from the jupiter user base very early on that they wanted real-time collaboration that they looked at these notebook documents in a similar way to how they look at documents that they work with in a word processor and wanted to collaborate with that that mode of interaction and uh so we've on the jupiter side we spent many years working on this we've had a number of sort of false starts um it's compounded by the fact that building a the the needed infrastructure and architecture for real-time collaboration from an from an algorithm perspective is quite complex thankfully the underlying algorithms have improved over the years and so uh just this year in jupiter lab three we've launched the first support for real-time collaboration and we're using a another open source library that's been fantastic for this called yjs it offers a very high performant crdt implementation in javascript and so that's really what what has enabled us to build real-time collaboration in jupiter lab three and uh so if you any user of jupiter lab downloads uh the latest version of jupiter lab three there's a special flag you can issue with the command line that enables the collaboration feature with that said we're just getting started in terms of the full experience of this there's a lot of additional user experience dimensions that we need to add other technical dimensions but it continues to be a major focus of the jupiter community and something that users want and have wanted since the very beginning now the broader picture of collaboration that you mentioned in machine learning i think what what we see both at aws and in jupiter is that there are many different personas that participate in the overall machine learning workflow and the key points of collaboration are between those different personas so for example when a data engineer who's been preparing and getting the data ready hands off a data set to a data scientist for them to to work on it and i think that that mode of collaboration between personas is really one of the main challenges we see both in project jupiter and in in sagemaker products on the aws side and it's very different from environments where collaboration happens primarily among the same persona and it adds that's not to say that that that pattern of collaboration never happens in data science and machine learning but i think the the more the more challenging one is the cross-personal collaboration yeah i could see arguments for [Music] that making things uh easier in that you have these well-defined interfaces between or you know not that they're inherently well-defined but there's an opportunity to define an interface between the personas whereas if you have people in the same with the same role in the process working on the same thing it's easier for them to kind of walk on one another's work so to speak um but it also defining those interfaces can be challenging absolutely and that's it really is and and there's both the interface from the perspective of a programmatic api and then also from the perspective of a like a graphical application and the other you know we you quickly get into the challenges of of distributed uh and shared data structures and that is some personas tend to work with entities that are immutable others with entities that are mutable and figure out how to get those personas to collaborate when the underlying entities that they're dealing with are fundamentally different so for example software engineers are completely familiar with collaborating using git and version control system but when you look at other stakeholders that want to interact with maybe the notebooks the data scientists are working with they're not going to be using git right they probably want a graphical interface that allows them to comment on a notebook in the same way that you comment on a word processing document right they're not going to be on github they're not submitting pull requests and using the get command line or anything like that and yet it's still the same entity underneath it's still a notebook at the end of the day and so figuring out how even what the the entities and data structures are underneath the cross roll collaboration i think is a really major challenge an area that i wanted to talk with through with you is the it's the role of the notebook overall this may be circling back to the very beginning of the the conversation but um this is i think a conversation that's happening fairly broadly in our community and that is you know often comes up as you know our notebooks the right tool for machine learning or you know notebooks versus ides um and you know when you think about canvas in the mix you know maybe the question is you know no code versus notebooks versus ides uh how do you react to those types of questions yeah this is a this is a great one so first i'll tackle the question of when should you use a notebook versus an ide or is the is the notebook a substitute for the ide and really here uh i think that the question to ask is what is the fundamental activity or task you're performing and in the case of an ide typically that task is that you're building something you're building software you're building a service an api a software product and so the fundamental verb of an ide i would say is build now maybe they're secondary verbs that would be test deploy debug etc that go along with that whereas if you look at a notebook and and what the notebook was built for and how people use it i would say that build is probably not even secondary and so what is the sort of fundamental activity i think it's really that the notebook is a tool for thinking with code and data that when it when a user's working with a notebook at the end of the day they're trying to work in parallel with the computer to understand what is in the data and what they might predict and what the meaning of that prediction is is their causation there is their bias how can they use this to explain the result do they trust the prediction that the model is making can they use it ethically these are all human questions and that and so the the notebook really is a tool for thinking and when you have this perspective there's not really any confusion between an ide and a notebook there are two different tools that are used for two very different tasks in the same way that uh an suv and a two-seater sports car are two very different vehicles used for a different set of purposes and if you try to take an suv and drive it and get the sports car experience out of it it's gonna be pretty disappointing and vice versa and so when i when i hear people sort of complaining that jupiter is not a very good ide my i sort of the filter i read that with is more along the lines of someone saying that an suv is not a very good sports car namely yeah it's not it wasn't designed to be uh jupiter was not designed to be an ide in the same sense that it's used for building and deploying and debugging software products with that said there is a gray zone where users start to work in a notebook interactively thinking about code and data and at some point a project gets to the point where it's mature enough that people start to build things that transition is still really painful and it's painful whether you try to keep working in jupiter or you go from working in jupiter with notebooks over to traditional ide and i think that's a major area of innovation that uh there's a lot of potential for the jupiter open source community and and others to dive into and figure out what does this transition look like from thinking with code and data to building software products and what how do you make that transition and work in that sort of in-between zone yeah i was thinking there are a number of efforts you know taking different approaches to try to productionize the notebook um i'm sure you you've seen these as well and yeah up until the very end of your comment i i would have thought you were thinking that those are all misguided but it sounds like rather they're just attempts to figure out this confusing thing that we don't really know what it needs to look like just yet yeah i there's a lot of there's a number of efforts in the jupiter open source community and other open source projects around taking notebooks and using them in a more production oriented way so one idea that's a it's a jupiter subproject called voila it allows users to tag cells in a notebook and then turn those cells into an interactive dashboard that looks like a web application does not look like a notebook and deploy that to users so that would be more of a human-oriented deployment of a notebook to a group of users who never want to look at the code in in the notebook but who want to interact with the outputs of that notebook maybe another example as you brought up was the idea of scheduling notebooks and that is that that at some level you can write a notebook like you can write a python function a notebook could be parametrized by a set of arguments and then you might want to run that notebook for different set of arguments on some schedule and both of those usage cases are things that we're seeing and i think different uh jupiter users and aws customers are doing things like that um i think the more challenging cases in this space are where you want to build a machine learning model initially in a notebook but eventually transition to building a model as part of a broader pipeline that leads the deployment of a model with an endpoint that transition between a notebook and the more traditional building that you do in an ide is still quite painful and i don't know that these notebook-driven dashboards or notebook scheduling are really the right answer to tackle those before we wrap up i wanted to cover another of the announcements that was made at reinvent and that is the new amazon sagemaker studio lab product that you and your team worked on tell us a little bit about studio lab and how it came about yeah so as you mentioned amazon sagemaker studio lab was launched at re invent this year and the the origin of this really comes back to the following question and that is what's the minimum set of things that someone needs to get started with machine learning and i'm using afraid getting started with machine learning very broadly here this could be students who are learning about machine learning data science in a university class they could be learning on their own in a self-paced way or it even extends to people who already have a good amount of machine learning expertise and are more enthusiasts and not doing machine learning though in enterprise context where they have a support staff that maintains cloud-based infrastructure for them and really if if you look at how people learn machine learning these days there's a small set of things they need one they need notebooks so they need some way of using jupyter notebooks two they need open source packages for machine learning they need tools such as numpy pandas tensorflow psychic learn pytorch etc and then they need some place to run the code they need compute of some sort and storage that goes along with it and those basic ingredients are really what what is how we approach sagemaker studio lab so it really is an abbreviated version of sagemaker studio that provides a notebook notebook-based development environment for users where they can use a jupiter lab-based uh environment and we offer free compute and free storage along with this and so the the real significance here is that this is users don't need to have an aws account for this they can sign up with an email no credit card required there's a simple uh account approval step that takes a few hours once you have an account you get 15 gigs of persistent storage for your project and then you can attach that storage to either a cpu or a gpu runtime and work do data science and machine learning and work with it in that context and it's all free and because there's persistent storage behind this um even though we obviously behind the scenes we shut down the instance when you're not working when you come back all your files will still be there and so this is a real file system that that's persistent and allocated to your project you can check out git repositories locally you can install python packages persistently and save your data sets and notebook notebooks alongside of all those things and is it uh from the perspective of building the the product for the intended scale uh is it a yeah how do you how did you think about this as a product is it was it a ground-up effort you know starting from you know components like you know ec2 and all the other components that aws has in jupiter or you know is it a starting from sagemaker and kind of trim back uh how do how do we think about the way this came together yeah that's a great question so when we built amazon sagemaker studio that was launched at re invent two years ago we built a platform that enabled us to build these types of applications where you have an interactive user interface connected to compute underneath with a persistent file system the and so that platform was already there uh with sagemaker studio and we've reused that platform for sagemaker studio lab now there's some some key differences between the two and some common points i'll start with the common points so part of the reason we built this platform for sagemaker studio was that when we talked to customers they had security needs that could not be satisfied in a traditional kubernetes environment and so the the platform that we're using for sagemaker studio and for studio lab is we're not using kubernetes it's based on instances our customers have told us that they need instance level isolation from a security perspective and so a lot of the work we've put in on the sagemaker studio side with encryption at rest encryption at transit vpc support instance level isolation we've been able to take that and apply it in the sagemaker studio lab case and so one of the sort of hidden features of sagemaker studio lab even though it's not focused on large enterprise usage cases it still has all the enterprise security instance level isolation that we have on the the sagemaker side of things so we were able to reuse that a lot of the the magic of this platform is uh because it's instant space there's a potential for to take a long time to start instances and we've done a lot of work and innovative pretty incredible stuff that enables instant start times in sagemaker studio lab to be fast enough that it's not really going to get in the way of most people obviously it can't be you know hundreds of milliseconds or something like that but it's fast enough i find when i'm using sagemaker studio lab that the the runtime an instant start time is fast enough that i don't really think about it and so we have been able to reuse that platform for sagemaker studio lab and the other point where the the enterprise security is important is that we wanted sagemaker studio lab to be a place where we could tell users it's fine if you want to install your aws credentials to another aws account that you have a paid aws account and make calls out from the aws sdk or command line from sagemaker studio lab and if we didn't have that enterprise security in place we could not tell users great install your aws credentials on the on studio lab does that mean that there's a key store feature or that you consider the instance level security to be robust enough that i can just you know put my aws keys in a cell in the jupiter notebook so please don't put your credentials in a cell in a jupiter notebook that's that is definitely an anti-pattern and you know we we regularly see customers who do that and then later forget about it and version control the notebook and you end up yeah the credentials end up on github yeah um so that's why i was asking no no yeah it's a great great question um so the the model is that your project gets 15 gigs of persistent storage and that storage is encrypted and so you can install your aws credentials just like you normally would on your laptop and that because the the entire drive is encrypted and we handle it with all the necessary security precautions uh you can install your credentials that way but yes putting them directly in a notebook obviously is not the not recommended in theory not because it's any less encrypted but because you're more likely to put it someplace where it shouldn't be yes if you never move a notebook out of sagemaker studio lab that notebook will be saved on that same encrypted volume and it will be fine the risk is more that the users would later do something with the notebook that exposed those credentials outside their original context maybe just one more kind of to wrap things up uh where do you see all this going what are you most excited about in terms of the future of ml development and um you know human computer interfaces for machine learning i'll answer this personally and for me i thrive on ambiguity and challenge i want to i want to be working on things where the answer is not known or it's not obvious or there's significant challenges involved in coming to an answer and i when i look at this space the the amount of challenge and ambiguity that we have left is vast and so i find a lot of just a lot of enjoyment in working in a space where there's so many unanswered questions and some of it this does come from my background in physics and i love physics and still you know at the end of the day i'm probably still a physicist one of the challenges in physics though is that as the field has grown and more and more has no is known in the physics space there are fewer and fewer unanswered questions that are available to be answered that are really deep and interesting now there are some um obviously but they you really have to hunt in physics honestly is my my senses of physicist is that you have to work really hard to find good problems to solve whereas in this space there's an abundance of problems and uh they're all really challenging and really interesting and there's a lot of people who will benefit by solving those problems and that that's what i'm excited about uh in terms of looking forward in this space awesome awesome o'brien thanks so much for joining us and sharing a bit about what you've been working on and some of the new sagemaker and studio announcements coming out of re invent thank you so much sam it's really great to be here and talk to you about all these things and i appreciate your time and and questions about the uh the early history of jupiter which is a really fun story to tell you

Info

Channel: The TWIML AI Podcast with Sam Charrington

Views: 122

Rating: undefined out of 5

Keywords: TWiML & AI, Podcast, Tech, Technology, ML, AI, Machine Learning, Artificial Intelligence, Sam Charrington, data, science, computer science, deep learning

Id: 2N18paoYOMg

Channel Id: undefined

Length: 61min 56sec (3716 seconds)

Published: Mon Dec 13 2021