Data + AI Summit 2021 - Full Thursday AM Keynote on Apache Spark, Data Science + Machine Learning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] so [Music] hello and welcome back to data and ai summit 2021 i hope you've enjoyed all of yesterday's content and sessions and that you're ready for another full day we've got a lot for you today but first a few quick reminders first get social as we go today be sure to use hashtag data ai summit and share your thoughts and feedback with the rest of the community also check out the data ai summit's social channels for the latest summit news and activities second please find some time to visit our sponsors in the dev hub and expo lots of cool demos and prizes and different ways to engage with them third if you feel inspired to do so please donate to any of the three causes that we're sponsoring here at the summit this year as a reminder database is matching all donations up to fifty thousand dollars which means that together we can give up two hundred thousand dollars we've made good progress yesterday but still have a long way to go so i hope you can all take time and learn more about these organizations and donate yesterday we talked about our theme our shared belief that the future is open and we talked about how this community is at the heart of the open source revolution of data and ai today the projects that you see here get over 30 million downloads every month these technologies are essential to building lake houses which leverage open source open formats and open standards to unify data types tools and workloads in yesterday's talks we focused heavily on that middle layer of the lake house delta lake which brings data management and governance to enable any downstream use case on top of your existing data lake today we're going to focus a lot more on the bottom and top layers starting with apache spark spark has always been at the heart and soul of this conference and the community we built together around this remains critical to the open future we're talking about here today the community remains super vibrant and passionate and there's a lot of super exciting innovation happening around apache spark to talk about it i'm excited to pass it over to the all-time contributor to apache spark captain reynold chin thanks ali good morning good afternoon and good evening i hope you've enjoyed data and i summit thus far this conference started as a spark summit and this routes are very important to me personally and the apache spark community even though we've expanded to cover a lot of other topics and technologies i want to talk to you today about the growth of spark its shift in usage and in particular making apology spark better for data scientists today we're seeing more and more sparks used at the engine to power the lake house architecture and this is about combining the very different types of workloads and use cases from etl to bi to data science to machine learning but this was not always the case in 2013 virtually everybody was using spark scala api over the years we've done a lot of work to make spark more accessible to data teams including improving the sql and the python api last week we looked at a number of commands run on databricks platform and it's astonishing when comparing with 2013 while is for sure still a first class api and powers many and many of the most important compute intensive data engineering jobs you see the rapid rise of python and sql each claiming almost half of all the commands and this is really the evidence for the lake house vision because data scientists are using python while a lot of analysts are using sql the development of the spark project also reflects this trend we're investing heavily in sql in particular when it comes to nc sql compliance and python i don't have a lot of time to go into details about all of the changes and we have to spend hours and hours here but instead today i want to focus on data science and python so most data science workloads are done on laptops today and the dominant programming language is python due to resource constraints on these laptops and sometimes just beefier servers even the way the libraries are designed they can usually only handle hundreds of megabytes of data and the occasional going up to gigabytes so data scientists often work with a downsampled data set instead of the entire data set now of course some data scientists also use spark to process terabytes or even petabytes of data however sparks are very very different too from what most data scientists know and the gap between laptop data science and distributed computation can be very large two years ago we came up with this idea of how do we marry the both worlds the single node python data science world was distributed computation on big data we want to really make it easy for data scientists accustomed to programming on a single node like a laptop to work in a distributed spark environment and as all of you are probably aware the most important library the single most important library for data science is pandas if you pick a data science 101 class chances are it would teach you pandas so we came up with the idea to scale pandas api using spark and created the experimental library called koalas koalas were announced about two years ago enables data scientists to easily port their pandas code base over with one line of import change their pandas code can now execute a scale there's no need to learn a new api you can even just search on stack overflow and find snippets of code that have shared and use those to cover much larger amount of data in the past two years we've been really humbled by the reception and adoption of koalas every day now we see three million downloads of koalas on pi pi the python package index and many of our customers have told us koalas fundamentally changed the way they work just like data engineers used to tell us about spark that changed the way they conduct data pipelines data scientists tell us now koalas have changed the way they do data science at scale today we're sure pretty excited to announce the databricks donating the koalas project into upstream apache spark we'll work with the rest of the community to merge koalas in now anytime you write code for spark you know that the pandas api available at your disposal there are two big advantages of running your pandas code using spark scalability and performance even on a single node for larger amount of data like one showing on the screen there's 31 gigabyte pandas on spark can perform better than pandas itself due to multi-threading execution pandas runs everything in a single thread sometimes pandas perform better because the lower overhead um but once you get to larger amount of data pandas wouldn't be able to handle it often you get out of memory error or if it could it might be slow due to the single thread in nature pandas on spark on the other hand can leverage spark's own computation engine to handle large data sets even on a single node it can gracefully handle data much larger than memory using external operations and of course if you add more compute resources to it it can scale linearly to reduce runtime all using the same pandas api one other thing is really cool about merging koalas into a spark is that spark users now even without using the pandas api can get visualization capabilities out of the box with just a few lines of code they can plot beautiful charts for their data big and small and the best part is users don't even to downsample their data the pandas api on spark has implemented efficient plotting techniques based on the types of plot you're using if you're plotting a histogram kde or box plot you compute the data to plot using pi spark api under the hood such as bucketizer kernel density and approximate percentile it then passes the result to a plotting library such as partly to draw the plots you can finally plot your entire data set without needing to down sample it first and written out the memory error now plus such a pie ch chart bar chart or bar scatter plots implemented by taking the top end records the pandas api on spark selects the top end by limited head and passes the result to plotting library if you're plotting an area chart or line chart it runs a uniform sample across all the data and draw supply using the plotting library so now spark wants koala's merging to spark spark will expose many different apis for different use cases or personas and what's really cool is all of them leverage the same underlying engine we go through the same optimization pipelines so you get more or less the same performance regardless of which one you want to use when you hold we hope you all find the adoption of the pandas api to spark make sure you move to big data python's data science simple in the second half of the talk i want to talk to you about the general area of work to make spark more pythonic as part of the project then project zen's name after a set of 20 guiding principles in the zen of python by tim peters introduced in 2004. clarity is very important for a great developer experience so here are the four principles which guide us to increase clarity first is error should never pass silently readability counts explicit is better than implicit and simple is better than complex you have used pi spark you probably know this pi spark's error messages are not the easiest to read as as a matter of fact there's a tweet by randy here in 2015 that pointed out hey 30 characters of code can you 97 lines of error messages and we have heard you randy back when some of these tweets were returned we had pages upon pages of error messages intermixing typically python stack traces and java stack traces we started improving error message in spark 3-1 making the origin of the python error much more clear and in the data and ai summit in 2020 we talked about actually cutting down six pages of error messages down to just one in spark 3.1 but we didn't stop there now we have actually make this error message even shorter by another 50 and this changes in development will be available in the next release of spark now being concise isn't the only goal another important part of error messages design is to make it easy for users to find the underlying courses the typical workflow for most data scientists or engineers is search on the web of stack overflow to find what's going on with their program and to make that specific workflow easy we're introducing error codes to error messages this will make it much easier to locate and capture the root cause of errors and this should change will also be available in the next release of spark while runtime errors are helpful it will be even better if we can catch error and mistakes at the time of writing code so we've introduced type hints throughout spark's code base that will result in much better error messages in both ides and notebooks using static analysis as shown on the screen here in addition type hints also act as an explicit contract for large copays and this change will be available as part of blockchain one now let's move on to autocomplete i personally rely on autocompleting ides and notebooks every day because i don't i can't memorize all the different apis before projects then autocomplete for pi spark leaves a lot to be desired it lacks context and can't show the most important things in the screen here i most likely wanted to look at the various different type of csv parsing options but the suggestions are not at all useful now with project zen we're shipping a much better auto company that's context aware and relevant and it works both in ids as well as notebooks all this great python code you're writing often require dependent libraries to be installed or configured and this is uniquely challenging in a distributed environment where some of the nodes that have library available while some of the others don't um we'll now actually improve our integration with conda virtua and packs to ship and manage python dependencies so you can manage all these dependencies of your program on a cluster using the same set of tools on a single node now i'm going to pass it over to brooke wennig a data scientist at databricks who uses pandas every day she's going to show us a demo of some of these new changes working together thanks reynold this past year hawaii has become an even more popular vacation destination and for this demo i thought it'd be fun to analyze some airbnb listings for hawaii so that we can plan our vacation too plus who doesn't love pandas on a surfboard i started my foray into data science with pandas i love the ease of the api the documentation the developer community everything about it i love as you can see i've already written some pandas code here i'm going to load in a data set filter out some records that the price is either free or above ten thousand dollars because yes those show up in the data set and then select just a subset of the columns now let's go ahead and take a look at our data we can see here the number of bedrooms the listing the name etc but now i want a solution that scales maybe not just for hawaii but for all airbnbs across the globe to do that all i need to change is just one import instead of importing pandas i'll import pisspark.pandas and i'll change the alias as well so now we have a solution that scales without requiring any code change apart from an import and as you can see here we get the same results so by merging this pandas api and spark into the core apache spark project spark finally has a solution for plotting data at scale i no longer need to down sample my data or collect it back to the driver based off of the type of plot that i'm using the pandas api and spark will figure out the optimal way to execute it so now let's go ahead and take a look at the average price based off of the number of bedrooms you can see here that the default back end is plotly a visualization tool many of you already know and love and we can see that as the number of bedrooms increases the price also generally increases but there is a steal of an airbnb with 11 bedrooms for 900. but you're not locked into just using the panda syntax we can always call 2-spark on our data frame and get the underlying pi spark data frame this effectively drops all of the metadata and so you can see there's no performance hit to going back to our underlying pi spark data frame so now we can always use the pi spark api to ask questions such as what is the most popular property type that that is available for rent and we can see the top three are condos houses and apartments if we were to do that using the panda syntax you'll actually see it's quite different the end result is the same but the syntax is quite different and this is what the pandas api and spark seeks to unify one api that scales as your data scales so you don't need to spend time converting it over into another syntax or accidentally introduce errors in the process and lastly the moment you've all been waiting for which airbnb should we run while i could write this code in python to answer that question i actually find a little bit more intuitive to write it in sql here you'll notice i'm using ps.sql quite similar to spark.sql but it has an added benefit whenever i work with the pandas api on spark data frame it's automatically registered inside the sql context what that means is that i no longer have to create a view or a table to be able to then query in sql i can just natively access it by passing it inside of these curly braces so i'm going to go ahead and select all of the columns but i'm going to filter for just looking into the neighborhood of lanai i'd like the host to be a super host the price would be less than 400 and it must be sparkling clean so it looks like we're recommended to stay at the artist's house i don't know about you but i'm interested in taking a look at some photos of this let's go ahead and pull up the listing url so here we can see the artist house accommodates six guests two bedrooms overall seems like a great place to stay mahalo thank you everyone and now i'd like to pass it back to reynold but before i do that i think you need your very own surfing panda reynold surf's up thank you brooke for the demo and of course the surfing panda from hawaii just to wrap up the core focus of smart project today is enabled lakehouse architecture creating a single platform for data engineering sql bi data science and machine learning in this talk we talk about we focus on two parts the first is the merging of the koalas project into spark so data scientists can now be using a single pandas api to form analysis on both data small and large on the laptop and large data on the cluster in addition we have done a lot of work on the user ability to make spark a lot more pythonic with dramatically shorter and more clear error messages contextual autocomplete and type hints these changes really marry data a single node data science and big data making spark an even better tool for data scientists but in reality there are a lot of exciting changes everywhere that we didn't have time to get into today please keep an eye out for them in this conference or other talks in the future and the blog posts and release notes thank you very much i'd like to pass you back to you ali thank you reynold and brooke i'm really excited about the great work of project zen as well as the integration of koalas into spark it's going to make it a lot easier to execute data science at scale when it comes to data science one of the biggest areas of interest today is machine learning the ml4 project has had a huge impact on this becoming the most widely adopted open source machine learning platform in the world here to tell us more about how this project has evolved is mate zaharia chief technologist at databricks all right i'm really excited to give an update on ml flow the open machine learning platform that we launched as an open source project at this conference actually three years ago so in 2011 mark andreessen said that software is eating the world he was observing that in more and more industries software had become essential to performing uh well in that industry um and uh you know it was it was one of the key differentiating factors between companies and products and so on most recently a lot of folks have been saying that ai is actually eating software and the reason for that is that ai has actually started to play really important roles in large classes of software or even to replace previous systems that involved a lot of custom business logic and do quite a bit uh better with it so just as some examples of that you know recently openai used gpd3 a variant of gpd3 to just generate images from text you can just imagine you know some of the applications of this for something commercial google has been using ai for medical diagnosis for example measuring hemoglobin levels in a patient just from retina scans without an invasive procedure and openai has been uh basically revolutionizing the way that we can train robots to perform various tasks so that instead of fine-grained rules you specify for how the robot will move it can actually just learn how to do uh specific tasks efficiently so ai has obviously impacted all aspects of business but today it's only done that for a select a few companies you know the facebooks googles uh netflix's and ubers of the world who can run these massive uh you know engineering and development programs and build these kinds of applications uh so the natural question is you know what about the rest of us so interestingly when you look at these massive tech companies that are successful using ai today one of the reasons they're successful is because they've done massive investments in the underlying platforms it's actually not so much uh you know the algorithmic insights that they have and you know someone comes up with a new algorithm it's actually the infrastructure to actually take these algorithms and turn them into a reliable product that you can actually launch and that can can do important you know business uh decisions for your company uh in real time and so all of these companies have built what are called machine learning platforms infrastructure that allows the engineers and data scientists to build maintain and operate machine learning applications all the way from data to production and they have teams of hundreds of people working on just this infrastructure which then allows other product teams to build these hundreds or maybe thousands of internal ml applications that these businesses have so just as some examples in google the you know google open source tensorflow but they have a whole platform called tensorflow extended tfx that teams use internally to develop applications uh uber has a platform called michelangelo that supports hundreds of applications facebook has fb learner and so on so these are these are kind of the engines that power actually productionizing machine learning and and having hundreds of use cases on it so when we started looking at uh you know uh uh this this domain at databricks three years ago uh we we again we we started these companies were building them every company's ml platform was very customized and uh internal and bespoke for their architecture and we thought that there has to be a better way and so we asked can we design an open ml platform where actually a wide range of organizations can can contribute on this and make a shared infrastructure as opposed to something you have to build yourself and that's what we did with mlflow the open source machine learning platform so ml flow is an open source project at the linux foundation and it provides four key capabilities that you need to productionize machine learning um it's got components for tracking or monitoring how your application is doing over time for reproducible runs of your application for packaging and deploying models and also for reviewing and centrally sharing models through the model registry component and these components are designed so that all of the users involved at different aspects of the machine learning life cycle can easily collaborate on these applications make sure they're working correctly and troubleshoot any problems that come out but ml flow is also open in another really critical way which is that it's designed to have an open interface so you can easily plug it into your favorite machine learning systems or even internal systems you have or into your favorite you know vendor products that live alongside that life cycle so it can be used in any programming language and it can be used in any machine learning library and it also connects automatically to many popular commercial services to allow you to switch between them when you deploy your uh your application and easily use you know best and breed tools to manage your applications so if you're not familiar with mlflow i'll just highlight a few of the things it can do the first way most users begin using it is through the tracking component which allows you to track experiments while you're developing your model and then track production runs of it once you finish developing it to make sure that it continues doing well over time so in in a lot of machine learning libraries today they're actually the community has created connectors to make tracking and ml flow super easy so for example here i'm showing some codes to train a model in keras if you want to start instrumenting that process with ml flow and keep track of all the information in real time you just need to call the ml flow keras auto log function and that will actually automatically instrument keras and capture all the relevant information about the model that you're training and so that will give you an ml flow this this nice you know graphical interface where you can see all the times you've trained the model before you can compare it over time you can experiment with different parameters and you can also see you know which versions of your code were used and so you can immediately start you know structuring your development process and making it much easier to troubleshoot issues now that's just experiment tracking another key aspect of ml flow is the model registry this is a collaborative environment a little bit like github where you can register all your models and keep track of multiple versions of them and move your models through a software development life cycle where they can pass through stages such as testing and production and different users or automated systems can approve changes to them before you you know before you actually deploy them so this is another key element of the platform that actually lets you centrally manage these models and make sure the right one is used in each application the same way you would centrally manage you know different versions of software finally ml flow has a lot of built-in connectors from the community to different ways of deploying the model so one of the really powerful features is whatever library you use to develop the model once you put it in the registry it's in the standard packaging format that we call ml flow model format and then there are actually all these tools where you can easily deploy the same model without having to know how it was trained you know what software version what libraries and so on so for example you can take the model and put it into any of a number of open source or commercial online serving tools you can do batch inference on massive data sets using apache spark you can just load the model into code and call into it there or you can push it to edge devices and so you get this whole ecosystem where you don't have to write something special to get your model into these different deployment modes so we're very excited with the continued momentum of the project the community has grown a huge amount you can just see in this plot here over the past year we've grown to over 5 million monthly downloads of ml flow on pi pi and over 300 contributors to the project just in the past three years since we launched the project and on data breaks with the hosted version we provide we've seen tremendous growth as well uh today uh we have thousands of customers using uh ml flow on the platform and they're creating more than one and a half million um experiment once per week that are producing new models that you know then then they work with so a huge amount of uh production machine learning happening uh you know just through this platform so what's new in ml flow over the past year there's been a lot of activity in the community and i just wanted to highlight some of the new things one of the big efforts over the past year has been to make it easier to integrate ml flow into an existing application and we've done that by adding auto logging packages for a lot of popular libraries like i showed before with keras we're actually making this even um easier now by adding a single ml flow.auto log api which automatically detects which ml libraries you're using and enables logging for all of those so it'll be very straightforward if you have you know an existing piece of code you can just add this one line in there and start capturing detailed information about the training data and the code and the parameters that you used and of course the models themselves and begin you know operating on that information we've also added a whole bunch of new features to tracking uh one of these is built-in support for model interpretability using the popular shop library so you can now log explanations of your model and get these these graphics that explain you know exactly why the model is making specific predictions and obviously this is super useful when you're reviewing your model before you decide whether to deploy a new version to quickly understand you know why it's making predictions and what's different from the previous one you can also log arbitrary objects now like figures and text files and images and so on so you can customize the information that you're logging in ml flow and you we also have an ml flow thin client package that has much fewer runtime dependencies that you can use if you want a a lightweight way to explore the data you have in ml flow from an application without pulling in all of its dependencies so these all make it easier to work with it and then on the deployment side we've also got a lot of activity on deployment back-ends there were quite a few supported in the project already and in the past year we've also added a algorithmia and pytorch torch serve as ways you can deploy your models on there so once you save a model you have huge variety of options for how to use it and it's just one line of code regardless of your model type to get it into any of these serving systems one thing i wanted to highlight in the community that i'm really excited about is the integration between ml flow and pi card so ml flow is actually now part of the is is a key part of the the pi carat machine learning library if you're not familiar with pycart it's a popular um low code machine learning library that you can use in python basically with just a few lines of code you can you can load in some data it can understand the features in the data and it can try a variety of algorithms and give you what is now a deployable and manageable model using ml flow so it's a very popular library for just getting started with machine learning um and so the way it works in pika this is you know some basically a complete bi-card application that that drains a model uh on some insurance data and actually in in this application in case you haven't noticed there's actually uh this code in here is what enables ml flow so if you're using pycart you call your setup function you can just provide an experiment name in your mlflow server and it will automatically log all that information into ml flow you don't need to do anything else to get it and so you can just run you know ecs very little codes to actually experiment with lots of models in bicarb and if you add these parameters in there you automatically get this this information in the ml flow ui and then you can manage your models there you know share them with users review them and so on there's actually a whole talk about this at the summit i invite you to check it out if you want to see how to connect ml flow to low code you know automated machine learning finally we have quite a bit of exciting work on the roadmap ahead for ml flow this year and of course we'd love to get your feedback as we develop these things you know in the open source uh slack channel um so one thing that we're continuing to push on is the integration with the python ecosystem pytorch has a wide variety of really powerful tools for uh model explanation interpretability profiling and so on and these are two of the things we're working on today so first of all we're integrating with the captain library which is a model interpretability library powered by pytorch and we'll be integrating those visualizations directly into the ml flow ui if you haven't seen captum captain makes it very easy to interpret the prediction of a wide range of models using the latest and greatest methods for that so you can use it for computer vision to see what objects it's detecting you can use that for text to explain the predictions of language models you can of course just use it for simple you know models on numerical features as well and you'll be able to take all these things log them into ml flow and compare them across your experiments over time we're also integrating with the pi torch profiler for debugging the performance of your deep learning models another major area we're starting on is model evaluation so you've logged all these models in ml flow maybe you've deployed them how do you now evaluate you know how they're doing so there are two projects there the first one is offline evaluation let's say you've logged a whole bunch of models and you want to compare them maybe on a specific slice of the data or a new test that you've collected we're going to have apis that makes it very easy to do that and then the other one is online evaluation when you deploy your model will make it easy to also deploy a code that actually outputs you know captures and and saves the the predictions and the inputs that went into the model into a standard format like delta lake so you can now get these streaming predictions and start evaluating the online predictions and see how they compared to you know the data that you train the model um on third thing that we're doing is integration with production job systems so for example in databricks we're integrating ml flow tracking directly into the user interface for our job scheduler so you can actually see the metrics and artifacts and so on produced by ml flow directly there and this is also something that we'd like to use to do with other job schedulers in the open source project so this is sort of a sneak peek at how it will look this is the jobs interface and databricks you can see you know we've run this job a bunch of times in the past few days and you can actually see the custom ml flow metrics we have for this job directly in the interface so i can quickly eyeball this and see whether my job you know is starting to do poorly from an ml perspective even if it actually successfully you know completed and ran on time and actually we can see here one time i actually had too many null records in the data which is something i was able to just track and this is you know surfaced right here in the ui so i could quickly see that and fix the model and finally the fourth thing that we're doing is uh is more work on the open source ml flow server to make it easy to customize it so for example we're working on a plug-in api so you can have web hooks or custom logic every time an event happens in mlflow and so you can easily connect it into your organization's uh you know machine learning life cycle so there's an exciting roadmap ahead from mlflow and we'd love to get your feedback from it the final thing i wanted to mention is that we also think we're getting close enough in terms of you know experience with the project to actually launch ml flow 2.0 and this is a chance to to actually improve a lot of the core apis and data models and maybe add new capabilities uh so we've launched a survey where we'd love to get your feedback on machine learning infrastructure in general and in particular what you'd like to see in ml flow 2.0 so make sure you check that out whether you're a current user or not we'd love to hear what will make it easier for you uh to build an ml platform at your company and reliably uh deploy and maintain machine learning applications so next up i'm really excited to introduce clemens maywald the head of product for data science and machine learning at databricks we'll talk to you about some of the new machine learning capabilities we've added to our platform many of them that integrate tightly with ml flow go ahead clemens i'm passing the baton to you well thank you mate looks like you're reeling a big one i'm very excited to talk about machine learning on database today as mentioned just as software has transformed many businesses and created new ones ai will do the same by definition ai has been able to achieve tasks that previously wouldn't have been possible by manually writing software however we all know that most of the time spent in developing ai is focused on data and that without good quality data ai just doesn't work so what is really happening is that data is the bigger beast that is eating ai and while that may just sound like a catchphrase there's actually a lot of truth to it let's take a look at why the focus is really on data the main reason is that we've become really good at dealing with code and data separately but we're terrible at combining them in software engineering the main goal is functional correctness and in most cases you can write good tests to ensure that ai on the other hand tries to optimize a metric which can be a moving target with changing data measuring quality in software is also easier because it deterministically depends on code however in addition to code the quality of a machine learning model depends on the data model architecture hyper parameters and in some cases it can quite literally be random based on the random initialization of parameters finally the outcome of software is deterministic whereas in ai the outcome of training a model can change significantly based on changes in the underlying data all of these factors combined make it painfully clear that ai is hard because it depends neither on code nor in data alone but on the combination of both as a result many different people need to get involved i'm oversimplifying of course but software is mainly built by software engineers to train a machine learning model and apply it in production on the other hand you usually involve some combination of software engineers data scientists and data engineers so getting all of these people involved and coordinating among them is a major challenge now this is not uncommon of course some of the most meaningful problems can only be solved by bringing many people together however it becomes almost impossible if those people cannot agree on the tools that they should use in software development usually a team standardizes on a stack like which language code versioning or csd system to use and these tools have matured over time in ai i have seen many teams where different individuals prefer different machine learning frameworks use different ides and most certainly don't follow common cicd practices in fact this overview of the data in the eye landscape by matt turk draws a perfect picture now how you feel about this situation likely depends on who you are if you are a vc or a researcher you look at this landscape and you get excited this is clearly a thriving ecosystem of innovation with lots of opportunity for investments and phd thesis however if you're an enterprise architect who is responsible for making purchasing decisions or a tech lead on a soft engineering team you probably think that this is a procurement and devops nightmare you'll have to look at each one of these boxes and pick the right tool and then spend lots of time and energy stitching them together so the tooling is really hard because ai requires integrating many different components and the space is moving at a breakneck pace in summary we are faced with three challenges ai is hard because of the interdependency of code and data many people needing to get involved and a massive amount of components needing to be integrated so what are the attributes of a solution well first and foremost since data is eating ai the solution must be what we call data native this is already where many machine learning platforms fail because they naively assume that the data problem is already taken care of secondly data teams can really only be successful if they work together so a solution better be collaborative and open to all members of a data team and finally the solution must integrate all components to satisfy the full machine learning lifecycle and because the ecosystem moves at such a breakneck pace it better follow open source standards many solutions out there fail to meet these requirements because they either disregard data engineering entirely narrowly focus on only one persona within a data team or they don't solve for the full machine learning lifecycle now this would be a terrible keynote if i didn't have good news for you if you know databricks you know that these attributes are the core of what we do every day and this has been a very long time coming i'm extremely excited to announce the launch of a new product databricks machine learning databricks machine learning is a data native and collaborative solution for the full machine learning lifecycle it provides all of the capabilities needed for data teams to successfully train and apply machine learning models and focus on their business needs as opposed to playing system integrators so let's start with first things first what do we mean by data native as indicated on the bottom of this slide databricks machine learning is built on top of an open data lakehouse foundation if you think about it the first thing you should care about when picking a solution for machine learning is whether you can access any type of data at any scale from any source which includes access to data across clouds with delta lake you can do just that it doesn't matter whether you have images video tabular data or whether those come in small files like csvs or in terabytes of streaming data from iot sensors delta lake gives you consistent and high performance access to all of your data so what does this mean for machine learning specifically when you train a model with the machine learning runtime you benefit from optimized rights and reads to and from delta and because delta provides as the transaction guarantees you can trust that your data are available at high quality and consistency delta also provides built-in data versioning through its time travel feature through an integration with ml flow we automatically track exactly which version of your data you used when you train a specific model that gives you full lineage from data to model and comes in handy when you think about reproducibility so now that we've established what data native means let's look at how the data science workspace makes databricks machine learning collaborative the data science workspace facilitates collaboration by supporting multi-language notebooks you can use scala sql python or r all within the same notebook now the power of multi-language support really comes together once you collaborate with others so the notebooks also provide cloud-native collaboration features like co-presence co-editing and commenting and finally and most importantly for machine learning the workspace integrates tightly with ml flow here you can see the experiment sidebar that shows tracked parameters metrics and models right within the context of your work and at the core of database machine learning of course are all of the components that facilitate machine learning from data to deployment and back we start with dataprep as mentioned earlier databricks machine learning being a data native solution that is built on a lake house supports any type of data at any scale now there's more to this story than just being able to process any kind of data later in this talk i have a surprise announcement for you that addresses machine learning specific needs for dataprep once your data is in shape you can train models using any machine learning library you want you provide a runtime that is optimized for machine learning and packages up the most popular ml libraries like scikit learner tensorflow as well as popular tools for distributed training hyper parameter tuning or model interpretability for model deployment we support any and all deployment modes from batch to online scoring on the platform of your choice of course as you know databricks is multi-cloud so you can also train your model in one cloud and apply it on another or even on-prem or on edge devices getting to the foundation of machine learning what is often referred to as mlaps we found that true mlops is really a combination between data ops devops and model apps and of course on databricks we combine all of these aspects with data versioning using delta lake code versioning using our own new repos feature and model lifecycle management with the mfl model registry but that's very abstract let's see what it looks like when you get mlabs right and i like to explain this from right to left let's say you have a model applied in production you should be able to find out where it came from the model registry shows you the entire history and approval workflow the model went through from there you may want to know the offline metrics and who trained the model from the model registry you can go back to the experiment that initially locked the model and find all of the metrics parameters and artifacts and if that experiment was created on databricks we captured all of the environment information from your workspace as well you can probably see where this is going if you know the exact code that you used the hardware configuration library versions you installed and the exact version of your data we can give you a single button to reproduce the model and we do just that and this is really how you know that you've gotten mlabs right so in summary databricks machine learning is a data native and collaborative platform for the full machine learning lifecycle but that's not all as mentioned earlier i have one or two surprise announcements for you i'm extremely excited to announce a brand new addition to database machine learning a feature store our feature store is the first that was co-designed with a data and envelopes platform and i'll walk you through what that means in a minute during the deep dive and if that wasn't enough we're also announcing today the release of our own native autumnal capability databricks automl is a unique glass box approach to automl that empowers data teams without taking away control our customers love this approach to automl and case will walk you through this new product later but let's jump into the feature store first right i'm extremely excited to talk to you about our feature store our feature store is the first feature store that has been co-designed with the data and emmalops platform which provides several differentiated features let's start by talking about what we mean by a feature let's use a recommendation system as an example your raw data may look something like this a user table that has attributes of registered users an items table that stores information about the items you're selling and the purchases table that has a record of all of the items that users bought where you want to get to is a list of items ranked by the likelihood of a user purchasing them so how do we get there well with a machine learning model of course the inputs to machine learning model are referred to as features and the output is the prediction what features can we think of the most basic features are simple transformations let's say you have a product category and you want to encode that category as an integer when you feed it into the model context features are features that change with the context of where the prediction is made that can be something as simple as the day of the week or what device that user is using when they access your product and in some cases you may want to augment your data one popular example is whether you could access another data source to give you the weather in your user's location which may help predict their purchasing behavior and finally a very common set of features that are used for recommendation systems are time window aggregates like the number of purchases a user has made in different time periods as you may imagine some of these features like a simple transformation can be done by the model directly and don't require a feature store but others like time windowed aggregates need to be pre-computed stored in a feature store and made available to the deployed model so let's motivate the need for feature stores with a representative walkthrough of how data teams solve these challenges today we start with raw data sources the first time features are implemented is usually in an experimental setting data scientists often take the most expedient path like copying code from stack overflow to transform the data into the right features once data scientists found a model that works it needs to be productionized however most often than not the featurization code they wrote during experimentation doesn't quite work for a production setting so the features are then re-implemented sometimes by data engineering teams to scale to the full data set now if there's only one person working on one model this may still be okay but in big enterprises with large data themes there is often another model that needs to be trained that consumes the same features and because typically features are not easily discoverable and reusable this whole process is repeated even if the features are almost identical this is the first big problem that feature stores address no re-user features and once models are finally trained on the full production data set they're ready to be deployed however in order for the client application to make a valid request to the model it needs to apply the same featurization that was applied at training time because that needs to happen within tens of milliseconds the featurization is usually reimplemented once again this time by the team that owns the client application and these transformations need to be equivalent otherwise you get something we call online offline sku where the model performs poorly because the online model behaves differently than the model that was trained offline which is the second problem that feature source address online offline sku and the complexity of avoiding it so let's see how we can fix this the first simplification that we will make is to only implement any given feature once the feature store provides a feature registry that keeps track of all features that have been created and facilitates reuse across different models we found that that alone saves multiple months of productivity in big data teams next the feature store supports two access patterns that provide consistent versions of the features the first is batch for high throughput at training and batch inference model training can happen directly from the feature store because we co-designed the feature store with delta lake it inherits all of its benefits most importantly the data in the feature store is stored in an open format and can be accessed through native apis the same features are then made available at low latency for online serving this guarantees that the features passed to the model from the client are the same as the features used in training time now this already removes the need for re-implementing featurization in the client because the features can be accessed directly from the feature store and this is how most feature stores work the client requests features from the feature store and then sends them to the model but that still leaves complexity in the client like for example the client still needs to make sure that the versions of the features it requests are compatible with the versions of the model it calls because we co-designed the feature store with ml flow when a model gets trained using the feature store we store the information about which features it consumes with the model itself as a result in serving the model can deal with the full complexity of looking up the right features from the feature store and the client can be completely oblivious to the fact that the feature store exists in the first place now this is important as a result features can be updated without any changes to the client significantly simplifying the deployment process so in summary the database feature store solves these two major problems through a feature registry that facilitates discover brilliant usability of features because we integrate with delta nml flow we can provide upstream and downstream lineage and a feature provider that gives consistent access to features at batch and online and significantly simplifies the deployment process by packaging up feature lookup with the ml model itself so let's see how that works to add features to the feature store you simply decorate the function that computes your features and call create feature table once features are registered the feature store ui provides an overview of all feature tables one thing i like to highlight here is automated data source tracking which logs all data sources that were used in computing a specific feature table with that information you can search the ui by data source which enables a key workflow for reusability when a data scientist starts to work they can search for all of the feature tables that have already been computed based on the same data and on the feature details page you can see the downstream lineage the feature store records all of the consumers or features and provides a reference to them this is critical when updating features or deciding whether you can deprecate a feature table or not and of course putting together upstream and downstream lineage you can go from raw data to features that were computed on those data to the models that were trained using those features and back from there you can create training data sets directly from the feature store this api takes care of joining of multiple feature tables based on the primary and foreign key relationships that are stored in the feature registry and to access features online all you have to do is to publish them to an online store note that there is no good example here of how to look up features from the client because you don't have to the model itself knows how to look up features so the client stays unchanged we are very excited to launch the database feature store and starting today it is available in public preview to all database customers of course some of our customers had already early access and saw great benefits from using the feature store abn amro is a dutch bank that uses databricks to fight financial fraud the feature store has been of great help to facilitate reuse of features and accelerate their model development and with this i'll hand it over to casey who will walk you through the second new component we're announcing today automl thanks clemens let's take splash with automl as clemens mentioned earlier we're very excited to announce databricks automl is now in public preview automl is a powerful tool that automates the machine learning development process from data set to tuned model this automation enables data scientists to quickly validate a machine learning project's feasibility as well as get a baseline model to help guide the project's direction in addition to solving these challenges databricks automl also addresses problems that we see in current automl solutions in data teams we always find a diverse set of people whose familiarity with ai is on a spectrum on one end of the spectrum are experts people with phds and ai they know exactly what they are doing and they want full control you can think of this as they want three pedals and a manual gearbox on the next level are engineers who may not have phds in ai but they still want control and they want the most tedious things taken care of for them so they want something akin to automatic transmission and at the other extreme are what we call citizen data scientists and they have this expectation for full automation but just like with autonomous cars most solutions for citizen data scientists just don't quite work yet in automl if you reach their limits and there's no way to take over control all you can do is open the door and get out and that's what we've seen with many of these citizen data science solutions and automl today that's why we're excited to announce databricks automl our approach is to augment rather than replace the data scientists by providing a low code automated solution where data scientists can take control and drive with their domain expertise at any time like most solutions databricks automl provides a ui interface as well as an api that takes a data set and performs data pre-processing feature engineering and selection model training and tuning and ultimately returns a recommended baseline model for deployment yet unlike most solutions we use codegen to enable data scientists to see exactly what is happening under the hood and provide them with the source code for each trial run in separate modifiable python notebooks the transparency of our glass box approach makes it so one you don't have to spend time reverse engineering an opaque model in order to make customizations using your domain expertise and two if for regulatory reasons you need to show your work you have the python notebook that you can share to explain exactly how your model was trained furthermore we integrate with all the databricks ml features and ecosystem for example all automl experiments are linked with ml flow so they track all the different parameters metrics artifacts and models associated with every trial run so that you can easily compare your models and easily deploy through batch or online inference and with that we're looking forward to you all using databricks automl to improve your team's productivity with these reproducible trial notebooks that give you that source code so that you can immediately modify these bottles and use your domain expertise to get a production ready model now let's go see all these different databricks machine learning features in action in a demo here we are inside of databricks to build a model to predict crypto mining for our security team i need to make sure i'm using the new databricks machine learning experience switching to this is going to change my left hand side navigation to show me machine learning tools and features that are relevant to my day-to-day workflow because i'm unfamiliar with the security team's data set for predicting deutsche mining i want to use the new databricks automl tool to jump start my development by having it generate a baseline model for me databricks automl is going to take a data set and do all the pre-processing feature engineering model training and tuning for me and provide me with the source code for a baseline model for me to iterate on so without further ado let's start an automl experiment on this page the first thing i need to do is select a cluster the next thing i need to do is select my problem type because i'm trying to identify whether or not someone is a crypto miner or is not a crypto miner this is going to be a classification problem i next need to find my security team's data set which here it is in usage logs and from here i can see a nice overview of this dataset schema which is nice for me because i'm not really familiar with it so it's a good refresher to see a quick glance through i can see that is mining is going to be my target which means i can come down here and say that is mining is the feature that we're trying to predict the last thing i want to do is check out advanced configuration one of my top goals is to give the security team a quick yes no answer on whether or not their data set can be used to predict crypto miners i'm going to use this time out option in the advanced configuration to have a predictable run time for my auto mail training so that i know i can get back to the security team within a day with an answer setting my timeout to 60 minutes means that automl will return the best model it can within that time frame i'm going to leave it at this one hour and go ahead and click continue with my basic configuration set up i can now choose to augment my data set with more features using the new databricks feature store feature stores allow our team to share existing features between different ml projects so rather than me having to spend time to create all these new features when i'm on time crunch i can instead utilize the features already created by my teammates for example this account features table is managed by my customer success team and has features relevant to a user's account because databricks machine learning is a data native platform built on top of a lakehouse architecture it can intelligently suggest features i may want to join with to improve my training performance as i said earlier this is the customer success team's feature table but it seems like it has relevant features to identifying a crypto miner so i'm going to go ahead and accept this suggestion i can explore and discover other feature tables in this drop down which my eye is caught on the ip feature table meaning which contains features related to the ip address which again seems extremely relevant to identifying crypto miners so we're going to go ahead and join on this the only thing i have to do is i'll make sure that my primary keys are aligned before we start auto mill there's one last thing to note for this dogecoin project i'm going to be using both the offline and online feature stores the databricks feature store is going to help ensure consistency between these offline features that i'm using right now in training and these same features in the online store when i'm doing real-time inference so that the model behaves as predicted without any training or inference skew and with that let's now start automl training now that hour has passed and automl has completed i'm going to land on a page that looks like this because auto mel integrates with the databricks machine learning experience you can see that ml flow has been used to track all the models parameters and metrics associated with every trial run this makes it extremely easy to compare each trial run and use my domain expertise to evaluate which model i want to use or iterate on for production the other thing to note in this table is that it's sorted by the f1 score of the validation set which is exactly what we said our evaluation metric was for automl so the best model is at the top now before i get too much into the weeds of comparing and evaluating these different trial runs i was pretty unfamiliar with this security team's data set so i kind of want to double check that this data set was fit for training an easy way to do this is with the automl data exploration notebook automl provides a data exploration notebook to help me gut check my data and get quick insights much of this code is tedious to write so data scientists like me may be tempted to skip this step to get to the more exciting part of building a model here you can see the pandas profiler is being used to give me some basic summary stats i can also see what warnings are associated with my data set so here it just looks like some features have high cardinality which for these specific features and make sense and i can also see the distribution of my various variables the main one i'm interested in is because i'm doing classification i want to see if i have class imbalance so i'm going to scroll to the bottom to my target and sure enough i can see that it looks like my classes are almost perfectly balanced so i'm not worried about that the last thing i want to check is my correlation matrix if i come down here i can see that in general it looks good it looks like event time might be correlated to my is mining target variable but that kind of makes sense because we do think more and more mining activity is happening as people are really hoping to get to the moon so now that i've got i've used the data exploration notebook to gut check my data set and i feel pretty confident that it was good enough to use a machine learning i want to go ahead and register my baseline model to quickly build out an end-to-end proof of concept with real-time inference to see if i really can predict minors in real time to do this i'm going to go back to my experiment page and i'm going to click this register best model button here i can create a new model to be tracked and managed inside the databricks model registry we're going to call this crypto detection the databricks model registry is a collaborative hub where teams can share ml models integrate with governance workflows and serve their models and real-time imprints on this model version page i'm going to go ahead and promote this baseline model to the production stage this is going to allow me to use real-time serving with it we're going to go ahead and say baseline for proof of concept and transition it to production now that's in production i can go to my main model page i can go to the serving tab and i can click in one button to enable serving serving is going to create a rest api endpoint so that i can call this endpoint from any app and get live predictions using this model now that my endpoint is live we can test it out directly inside of databricks one of the cool things about automl is that it's helping me adopt best practices with ml flow one of those things is to log an input example input examples are samples of your data set that you can then test your predictions on so you can see here i have some sample data that i can now feed in as a request send it as a request and get live predictions on whether or not someone's a crypto miner or not now remember how we use the offline feature store to get features for training well because we're in this data native environment the model knows which offline features were used in training and since that same model is being used in online scoring it can automatically grab those same features from the online feature store for inference without me having to write any code and to see for the proof of this you can come down to my logs and i can see that sure enough when i made this real-time call the online feature store was used and it grabbed features for my account features table and my ip features table great so this is now the end to end of me use building a proof of concept using a baseline model that i generated with automl but i'm certain i can improve upon this best model using my domain expertise and this is where the real power of databricks automl comes in if i navigate back to my experiment one of the things that you'll notice here is this source column in the source column there are reproducible trial notebooks for every single trial run and these notebooks contain the source code that trained that model so you can see exactly how data is loaded how one hot encoding is done you can see exactly how it was trained which hyper parameters were used and there's even a shaft plot so you can see which features were most important now i have a hunch that crypto miners are probably newer accounts on the platform especially since we've been seeing more and more mining related activity happening so i'm going to derive a new feature called member recency in the model that calculates how much time is passed between a new account being created and a user finding an event on the platform to do this i'm just going to insert a new cell i'm going to write my code that creates this new feature member recency and then i'm just going to make sure when i go to my pipeline here that i've added this feature and i should be good to go all i need to do now is click run all and it's going to run all this training code for me and i will have the next version of my model that is now using my domain expertise and i didn't have to reverse engine engineer anything i could just make direct edits to the source code notebook to iterate on it and get a better model all right now that training completed let's go check out how we did so if we come up here we can see that sure enough our f1 score did perform better so now we're at 95 percent uh and our log loss seem to have improved as well so now that we've improved using our domain expertise let's go ahead and register this version of the model we can use this handy code snippet here we can say we're registering it to our crypto detection existing model and we can just run this snippet to create version two of our model in the model registry now that this is completed i can go on over to the model registry i can click on my crypto detection and you can see here that i have my second version of the model now because the small performed better i want this one to be the model that's live for real-time inference so we're going to go ahead and promote this to the production stage and now when i go to my crypto detection model and i go to my serving tab you'll see that version 2 is now the one that's in production and once the endpoint goes live i'll be able to test it all right now that our endpoint is live and pointing to our version 2 of the model we can do what we did earlier where we're going to show our input example that was automatically tracked for us with automl we can see that now we have the member recency derived feature in here we can send this request to test our live endpoint and sure enough here we're getting better predictions and again just to end on that our online feature store was being used on our behalf automatically using the databricks feature store to ensure that we don't have any offline online sku and that's how we were able to use databricks machine learning to allow me to quickly vet my security team's data set so i could give them an answer on whether or not i could use machine learning to predict minors and the automl reproducible trial notebooks allowed me to get a production ready model in less than a day for my security team the databricks machine learning platform allowed me to build a machine learning model that could predict crypto miners in real time to help my security team identify the suspicious activity we were able to do this in a data native way where all my data was in one location and i didn't have to pull it from various sources we were able to do it in a collaborative way where i was able to work with my security team in databricks to make sure that i had the right domain knowledge to build this model and we did it from the whole ml life cycle from data set to deployment all within a single location and that's what the real power of databricks ml is thank you all for joining us and with that i'm going to pass it off to clemens thank you casey very excited about these announcements but wait there's one more thing you may be wondering how you get access to database machine learning we took all of these features packaged them up nicely and put a bow on it as a unified platform databrick serves all members of a data team so we're introducing a brand new persona based navigation to the databricks so that each member of a data team has the right tools at their fingertips users can switch between data science and engineering and sql analytics and this is also where you will find the new machine learning experience of course for those multi-talented among you you can freely switch between options throughout your workflow when you switch to machine learning we surface the new machine learning capabilities and resources to you including the new experiments page the feature store ui and the model registry some of our customers have gotten early access to all of these capabilities and i'm happy to share their experiences so that you don't just have to take my word for it admins is a popular car shopping website that uses database machine learning among other things for pricing vehicles on their website wildlife one of the largest mobile gaming companies in the world is using databx machine learning to personalize their user experience which drives their top line and h m the global fashion retailer is using databricks machine learning to parallelize model training for all of their countries articles and time periods as you can see companies across all industries can drive meaningful change with this solution and so can you so i'm happy to announce that all of the capabilities you saw in this keynote are available to all database customers starting today of course this was a lot of material so to learn more please check out our new website at databricks.com and with that back to ali thank you matteo clements and casey now it's my great pleasure to welcome patrick baginski senior director of data science and abny both director of data analytics from mcdonald's mcdonald's has been investing deeply in data and ai and i'm very excited to hear how the lake house is helping them advance their goals with machine learning thank you ali for having us everybody my name is patrick bigensky i'm the senior director of data science for mcdonald's and i'm joined here today by my colleague abhi bhatt director of data and analytics and particularly uh interested in the machine learning operations field we're going to be speaking today a little bit about our journey in the ml operations and data science phase and some of the choices we've made with regards to technology and architecture but if i'll before i dive right into these topics i wanted to provide a little bit of an overview for those of you who don't know us which i hope are fairly little we're we have we're we're essentially in about 119 markets in the world so that means we have over 4 000 users of our internal machine learning operations related solutions we're informing with those solutions about 39 000 restaurants and ultimately generating about 65 million transactions every day which means our our delta leg is actually well over a thousand terabytes of data strong 2020 has really shown corporations and the economy the effect of societal and environmental impacts and that's why i believe the qsr industry is actually no longer in the age of transformation that's why i believe we're now in the age of acceleration qsr by the way stands for quick service restaurant industry of which mcdonald's is a part two and the way that mcdonald's focuses on the importance of accelerating the value from data analytics is by focusing on the what we call the 3ds our most strategic topics the first one is digital so that means our application our website and other digital properties the second one is the delivery business both with our partners ubereats seamless and others as well as with mac delivery our own delivery service and then lastly the drive-through experience in which we're also building data analytics solutions and so based on that uh now is really the time for us to essentially rapidly develop data analytics functions and how we're doing this is we we actually started with a small team about two years ago of only five to ten people engineers data scientists analysts but now we have grown to actually over 45 people and we started using open source technologies in order to deliver value from data science solutions now at mcdonald's we we are in a hybrid cloud environment because markets operate slightly differently from each other so that means that each market can have a different choice of cloud and that's also why we invested in open source and other technologies that make it easier for us to manage deploy and build data science solutions and the reason why we're doing this is actually ultimately to really accelerate the time to value across operate across our operations and when i say time to value what i mean by that is the time it takes for a data scientist to launch an experiment prove value from that experiment and then also move it into production with the help of my colleague abhibad who is leading our ml operations team some of those examples of what we're working on is for example store forecasting fine-grained sku level forecasting we're building recommender systems we're working on drive-through automation and then a lot of typical marketing related topics such as customer lifetime value churn reduction segmentation but also supply chain restaurant operations and food safety topic and with that i'm now going to hand it off to abi who will speak to you in more detail about our ml operations thanks patrick hello everyone my name is abhibat and i am the global director for data and analytics at mcdonald's i'm excited to talk about machine learning and databricks journey at mcdonald's with you today in 2020 we went through a platform or a tool selection for our data science and machine learning users and clearly databricks was our top choice we we selected databricks based on a few key reasons they are one databricks is and it has a really easy deployment model where i can take the software and i can deploy it in my own cloud platform and keep the data and the infrastructure secured so that that enables us to do all of the data science workloads and setup and keep the data and the infrastructure in the platform secured second databricks as a platform provides the flexibility to the end users to bring the language of their choice they could bring they could do their work in python spark or pi spark or even just basic sql so depending upon the the maturity of the user the platform can address different languages and make it easy for them to adopt or come to the platform third i truly look at databricks as an integrated lake house platform with capabilities such as delta lake ml flow and data brick sql which makes it really easy to build models and scale them so as you can see from the visuals shown we bring in a variety of data into the databricks platform we bring in data such as our stored transactions our mobile app data for loyalty customer customer marketing data and also look at some third-party or open source data as well to bring it and as within the platform then the data science folks built a variety of different models such as lifetime value of the customer product recommendation demand forecasting etc these models then get consumed by either a front-end application which the end users can call and have an interactively called the models or it could be as simple as the the model output gets written into something like into a table and gets consumed through a report or a dashboard in tableau so with that i truly believe that with databricks as a technology and the partner we are on the right track of our ml ops and data science journey let's talk a little bit more about ml ops as that's going to be our key focus for 2021 and is a hot topic in the industry right now i truly believe that building data science models is hard but scaling and operationalizing them is even harder at a global scale of ours at mcdonald's so here and here's how we are thinking about it it all starts for us on how do we bring the data into the platform the way we do it is we bring in all the data into an s3 bucket where delta lake is enabled as you folks may know delta lake is a storage format that helps us do data versioning and also build scalable and performant feature engineering pipelines in the platform next we look at version control of our code we use we today we use a combination of git and our and and repos as repos will extend and make it really easy to manage code directly from the databricks workspace itself next once the models are built the data scientists folks go through model experimentation and this is done through ml flow experiment which enables tracking of model parameters evaluation matrix such as accuracy recalls or precision and law and we log model and we log model artifacts for each run next as we as we have selected the final model and are ready for deployment this is where again using ml flow registry we manage the life cycle of the model to go from staging to production or to archive and essentially get gets it gets the model ready for serving so when we go to model serving we think about this in two patterns one is on-demand batch scoring and the second is an hourly inference using sagemaker endpoints now we are also exploring data bricks serving capabilities which are just coming up and are excited to see how that also can help us streamline model serving through this platform now once the model is served we want to make sure that we are catching the drift or measuring the quality of the model as well in this case we are using a combination of um sage maker monitoring capabilities which are very generic for for measuring the quality of the model and also for building custom kpis using pi spark in the databricks platform itself so overall how all of this comes together i truly believe that this is a best-in-class solution for ml ops put in place using databricks ml flow as well as delta lake and even aws h maker next let's see what the numbers tell us you know as we talked about we started our mlaps journey in 2020 but within less than nine months we have gone from zero to production scale of mlaps of of scaling of the models and within this nine months or less than nine months we've not only identified the tool the technology done the legal paperwork which can always be a hassle but also identified use cases built the models and deployed them we've enabled more than 15 use cases with 30 plus models deployed in five plus countries or markets in which we operate multiple ml deployment frameworks have been built for the data science and the end users to ensure that we have the right security in place for for building and deploying models uh we built ci cd pipelines especially for deploying the models and we will continue to enhance these frameworks as you can see from a platform usage perspective we have used about 130 000 dbu units which is databricks units which is great for databricks and from a compute perspective we are about 27 000 computers on a monthly basis in aws and you know while these numbers may look impressive they are not for a scale of mcdonald's we anticipate that in 2021 and beyond these numbers are exponentially going to grow at least by four to five times and that will be when we are truly at scale doing mlaps at mcdonald's next let us share a little bit of our findings from our journey in data science and machine learning and and truly the first point i i would like to hit up on is that this is truly a journey for us of continuous improvement as well as learning there's always going to be new data sources new models or enhancements that have to be in place or skill sets that are needed so always think about it as a continuous improvement journey next identify the right technology and the tool set so that it provides great if it provides flexibility to your end users to either bring in a custom model of their choice or even do basic sql-like stuff in the platform right and i think that's that's kind of what has helped us a lot to go with data breaks in that direction third if you are building a new team or want to retain your team it it takes time to find and build uh find and get new resources uh so make sure you have a good hiring plan as well as if you're a consultant friendly company uh ensure that your team is supplemented by consultants and when you look at the skill set of your team it's not just the data science folks or the data engineering folks ml engineers and engineering is very critical as you go and scale your models or think about ml ops so don't forget about those resources and skill sets uh and last but not the least in this case i would call out is on the platform capabilities and technology changes they are changing really rapidly so being agile is critical not just in your thinking but also in ensuring that is the architecture and the solutions that you have in place so that you're always able to stay up and stay ahead of the curve and provide the best solutions and platforms to your teams so if you look at folks we have talked about why we selected data breaks uh how are we thinking about ml ops and and kind of shared our findings and as i mentioned before this is truly a journey and the journey continues for us and for this part i will hand it over to patrick who will take us on what are we gonna where do we go next and what does what this part of the journey looks like for us so thank you thank you abi for the deep dive on machine learning operations now our journey truly does continue as abby mentioned and so i hate to use buzzwords but one thing we're truly trying to do is continue to democratize machine learning and data science practices across the company using largely out-of-the-box capabilities and builds right and as as a consequence of that we also want to continue educating and providing learning opportunities to our teams to the markets to the different people that are part of the mcdonald's system we want to enable more end-to-end automation and machine learning operations in general and so i spend about an hour to two hours every week talking to startups and new ideas to find out what is the best path forward particularly particularly when it comes to machine learning operations and we want to continue to implement government governance and also cost control measures in order to make sure that what we're doing from the business perspective continues to make sense and last but not least you know very importantly we want to continue to learn and actually have fun along the way too so what does that really mean for us at mcdonald's so we we have a bit of a value acceleration roadmap now that we have the data bricks platform as a tool in our in our box these are just three that i'm mentioning here i mentioned talked about it before already we're going to do some very fine grained sku level forecasting for our restaurants we're going to continue to try to automate marketing and personalization related activities beyond just good machine learning for marketing but actually using some unsupervised and potentially even reinforcement learning techniques in order to further automate the end-to-end interaction with customers through offers and lastly if you're doing this you also want to have the cross-channel measurement and cross-channel performance measurement done right and even the ability to be able to measure across markets restaurants and marketing efforts what how we are essentially performing with our marketing offers and before i wrap up here first of all thank you everybody for joining the talk i do want to mention briefly our teams are still hiring at mcdonald's so you should be able to find a couple open job postings via my linkedin profile and the mcdonald's career page um and with that being said i'll hand it back over to you ali thank you rp and patrick really really fascinating work next we have a great talk from rajat manga the co-creator of tensorflow on how ai is eating all of software hello everyone i'm rajat manga and here to talk a bit about ai so this here is a deep neural network something you've probably heard a lot about especially if you're here today so this one as you can see it has all these different layers where it's taking inputs learning each layer sort of applies a function to that and goes through all the way makes a prediction learns from that and this is really one of the core algorithms behind what we see as a success of ai over the last decade in addition to these algorithms you know there are a few other things that have come together to make this possible here's one example so you know many of you may have heard of imagenet this is basically a data set with uh millions of images across lots of different classes that have been trained on to really improve the you know the state of the art on many of the vision models that we have today so that's sort of number two uh putting the data together in fact uh data's sort of become so important here that you'll see things like this if you are uh you know hopefully you're not stopping a car and getting the stop sign wrong while they're actually running their stuff but data is really you know making a big difference in how far or how fast ai is making a progress in so the third piece of these once you have the algorithms and the data together is how fast can you train these models and all the data set that we are creating here and that's where the computational power comes in you know what you see here is a tpu part it's actually a generation older than where we are today uh but this is already a hundred plus pair of flops with lots and lots of memory and you can see a pretty amazing super computer it would probably be just based on this one of the fastest super computers in the world at this time so uh you know that's the third thing this is the computational power that we've seen coming together and in fact as these three things come together that's really uh you know changed the name of the game with ai and being helped with tools like tensorflow and i was uh you know glad to be a part of the team that you know built this and drove this uh for a while so it was really exciting to see the change or the impact that this has been making now let's go back a bit over the last decade to see what what's uh happened during this time if you got a squint on this you'll see something that looks like a cat and the reason is we learned from a lot of youtube videos and of course who hasn't seen a youtube video without a camera with a cat so lots of good stuff uh scaled on tons of machines 16 000 cpu cores back in 2011 and that allowed us our network to really learn to recognize cats automatically without telling it what a cat was what a dog was and so on so that was pretty cool about this one now as we go along over around the same time in 2012 as deep learning was taking off folks over in toronto some students along with professor hinton basically came together and tried okay can you run this new kind of not really new kind of model but these kind of models can you scale them up and run them on really large data sets like imagenet before that people were really programming and coming up with very custom features to build things like this so when they did this they showed a huge improvement on anything before and this really started to change the game on how people were thinking about deep learning at this time over the next year or so at google we were applying this in lots of different areas building new kinds of models this is what came to be known as google net uh again applying similar kind of ideas with convolutional ads scaling across lots of different machines and what this led to what all these improvements in different things really led to improvements in products themselves so if uh you know any of you have heard about picasa which was almost like a precursor to google photos a while back that that was a piece of software where you could really organize your pictures and you could manage them and you could label them yourself you could put you know saying oh these are pictures from my uh last family vacation and so on so you could organize things like that what google photos did was you know what all that organization that you have to do manually is now gone you can just get things because you can search for them it understands what the images are it understands what the pictures that you're taking are and that's really you know changed the game in terms of uh all kinds of photo processing or image processing things that we were doing then of course when you're searching for pictures why type it out when you can speak it out and so that that was another area where we're seeing lots of advances around the same time uh in this case again using the same technology deep neural networks uh that's been the underpinning of a lot of this we saw where can moving from the old style uh methods which were also you know somewhat learned but somewhat programmed to these uh fully trained networks allowed us to do like huge improvements you know more than 20 error rate improvements over what was possible in the past and what that changed was uh from being a research topic or being used in like very very specific places speech recognition became mainstream where we can now you know talk to our phones and do all kinds of things you know you probably have a google assistant or an alexa or cd somewhere that we've used before of course that was step one they've since then they've been you know lots more improvements the next one was this uh new kind of architecture lstms that was applied to speech around that time and that that included uh that led to another 10 improvement on top of that and we've continued to see those huge advances since then now uh of course when you think about speech the you know part of the speech recognition is just to convert okay what you're saying to text but you also need to understand that text and that's where uh you know over the last several years we've seen a lot more advances in natural language processing as well uh started sort of with the the bert style of models and transformers where uh you know this research led to again huge improvements in what could be done with natural language and again in this case the data and the computer was as important the data set there was a q a data set that came out of stanford and people started to optimize things and really learned from that that led to some of these and for folks if you those of you who've been sort of involved in this area or have looked at this area of course more recently there's gpd3 that has shown even more advances as well of course let's see uh where people are using this what can we do with this and so one great example is google search itself you may be familiar with the page rank kind of algorithm you know that's one of the first algorithms that really started google that that's what google was based on when it started out since then there have been lots of changes and additions uh a lot of them manual a lot of them you know just programmed in terms of combining different kinds of rules together to uh get better results in search itself as you can see you know that this example really shows how just an advancement in terms of you know applying an algorithm like this like work deploying that in search went from just matching the basic keywords which which was okay but it wasn't really getting the context of what the user wanted and the one on the right it really shows how it can once it gets the context it can really answer those so much much better of course these are not the only areas you know we often think about all of these running in deep learning and other algorithms being very very expensive so they run in huge servers on the data center that's often true but but with recent advances it's been possible to really bring them down and run them on really small devices including your phones so in fact you know very likely whatever phone you have if you've bought it in the last few years it's running some kind of deep learning models in fact this example here shows a model being applied to really get a better picture after you've taken it and some of this can be done with your camera that's integrated on like pixel phones for example where it blurs the background combining not just your lenses but also a whole bunch of smartness in terms of figuring out what's the foreground from the background as well and you know on the phone itself there's given that now we can get things in that small form factor there's so much more that can be done for example here there are places where you can't always hear the audio or maybe you have trouble hearing audio and being able to live caption things like that automatically just on your device without worrying about anything else just makes a huge difference uh so thinking about you know there's a lot of consumer stuff that we've talked about and phones of course to show you the right things the right applications that you want to use companies like google you do a lot of machine learning behind the scenes to recommend the right pieces and there's a lot of stuff that goes on here as well in this case there was a recommender system that really took information about the user all the different things that could be recommended and decided okay what should be recommended to this user when it comes on he or she comes on next and in this case we saw you know what this is great there's a lot of stuff that goes on to to put all those pieces together and you know this was deployed it was doing great but then when people started looking at what's happening they realized uh you know one of the big differences was what was being run or trained on in the data center was different from what was eventually being run and executed to make the recommendations just fixing that one thing led to a huge improvement like two percent improvement in this particular case and so you know as you think about how all these pieces come together and they're made possible there's a lot of different pieces that really need to be done for uh making this possible there's so many different things to make this model work not just you know get all the compute and value them together and so they're there you know these are some of the pieces that you see right here in fact there are tools you know around tensorflow that have been built to really solve for this uh that whole ecosystem called tensorflow extended with all these different pieces that all the way from uh validating data ingesting it processing it building the models deploying production and we need to help folks take care of that and that's really to help uh you know bring ai into more and more things across the board so you know we're clearly seeing lots of advances in this and we're seeing lots of applications of that as well of course not all is great with ai and so there are a number of issues with ai as well and wanted to just talk about a couple just to make sure we're mindful of it when we think about ai and and do things at our end so so one for example you know with uh you know as i say with great power comes great responsibility there's a lot of power in terms of you know say identifying faces and doing recognition and so on that can be used in all kinds of ways not always in a good way if you think about cameras all over the world uh most cities today have cameras deployed in different places uh whether it's uh in the eastern side or or in the west there are so many different cameras that can be used in different kinds of ways to recognize people and do different kinds of things we need to be really responsible and thoughtful in how we want to use them as well entering an era in which our enemies can make it look like anyone is saying anything at any point in time even if they would never say those things so for instance they could have me say things like i don't know killmonger was right or ben carson is in the sunken place or how about this simply president trump is a total and complete now you see i would never say these things at least not in a public address but someone else would someone like jordan peel this is a dangerous time moving forward we need to be more vigilant with what we trust from the internet that's a time when we need to rely on trusted news sources it may sound basic but how we move forward and the age of information is going to be the difference between whether we survive or whether we become some kind of up dystopia thank you stay woke uh so as you saw here uh there's this opportunity to change things or how they look and over the last several years we've been talking about fake news in lots of different contexts and the ability to really make things look so real has never been possible before now this just makes that problem harder and we have to be really careful about how we use this technology in good ways as well so talking about you know issues those are some of the issues that ai is bringing but in broadly you know stepping back there are lots of different challenges that humanity faces as a whole today i'll talk about one of them you know energy if you think about how much energy we use you know here's a number here a million kilojoules one way to think about it is let's think of an energy required to boil a single kettle of water now think of 8 billion people boiling that kettle of water every single minute round the clock all the time that's really how much energy we are using every single day and today that's really coming from lots of fossil fuels really you know hurting us in lots of ways you know of course with climate change and the heating heat that's causing uh you know all not all is not bad in that this is one area where perhaps ai can help there are lots of you know renewable areas renewable energy sources that we are working on solar and wind being one and from simple things like just looking at okay how can we repair these windmills how can we find the turbines and the problems with them sooner to forecasting the solar and the wind power sooner and much better allows us to make a lot more progress and allows us to use these more effectively and efficiently towards a much better future and so really what i want to end with is ai is really making a difference and uh you know it has a huge possibility for impact thank you thank you so much so exciting to see what's happening with tensorflow i hope you enjoyed and learned a lot from today's talks i'm really excited about the integration of koalas into spark to make data science scale much easier the adoption of ml flow is really impressive it's great to see the work that the community keeps putting into the project great new features in databricks ml with automl and feature store it is now my pleasure to introduce our next guest speaker bill nye is popularly known as bill nye the science guy he's an american mechanical engineer science communicator and television presenter following the success of his show nye continued to advocate for science becoming the ceo of the planetary society and helping develop sundials for the mars exploration rover missions please join me in welcoming bill nye to data and ai summit greetings greetings bill nye here hello to everybody at the data and ai summit uh it's great to see you all again wait i can't see you but you can see me and uh that's good because we're using millions and millions i would say certainly 10 million but probably closer to a billion transistors to engage in transistor transistor logic and have this fabulous interaction so thank you all for including me now for those unfamiliar with this kind of picture this is a picture of the earth and it's where i've spent my whole life down there someplace and i just like to remind people it's not out of focus that's the earth's atmosphere and the atmosphere is extraordinarily thin in many pictures from space the atmosphere is so thin how thin is it you can't even see it looks as the old saying goes it's like a coating of varnish on a globe but i spent a lot of time in the atmosphere and uh breathing it and carrying on and my first job out of engineering school was uh at boeing and i worked on 747 airplanes don't worry everybody if you've ever been on a 747 i was very well supervised it's a fantastic plane uh you're you're very safe on a 747. all right with that said when i was there there was a a guy who was still around who was a test pilot on the very first 707 in those days it wasn't even called the 707 yet that was a designation that came along later that was the dash 80 this one is when they first painted this picture when they first painted it and you don't have to know too much about airplanes can you see uh two there's two features about it that are cool do you see how long how wide the wing is at the root as it's called of the wing in those days that was all that structure was needed to hold the nacelles the engines and then you see all those veins those are on there to create uh to promote the turbulent boundary layer to make the air flowing over the wing do this the molecules tumble and it actually reduces drag in exactly the same way you reduce drag by putting dimples and golf balls uh so i spent a lot of time there and there was a guy who was still around who was the test pilot whose name was tex johnston and uh texas johnson like many guys of that era he had been a pilot in world war ii he affected that this at this u.s accent that chuck yeager the famous guy here in the states who broke the sound barrier in a rocket plane the x1 and uh tex johnson was a test pilot on this airliner and he did a barrel roll with the airliner he took his giant plane and rolled it over twice and there's this famous picture taken by the flight engineer this is back in the day when uh an airplane like that would have three guys three people in the cockpit well in those days they were all men uh in the cockpit a pilot co-pilot what we now call first office uh uh captain and first officer and then a flight engineer and the flight engineer took this picture this is the guy that would watch the fuel gauges and make sure they weren't messing up but it's upside down so tex johnston lands and this is before everybody before mobile phones at every press conference and video of everything that ever happened you know this is a different era but by all accounts boeing management says to text johnston bro i paraphrase but you know text what are you doing this is this is our prototype this is the only you know be careful with this thing he said i'm selling airplanes and then he's supposed to have said one test is worth a thousand expert opinions and those are words to live by my friends one test is worth a thousand expert opinions you can't beat it that you can claim you know what's going to happen but to really find out what's going to happen is uh of tremendous value and let me say i am honored to be invited to speak at the data and ai summit because you guys and gals are the future you all are going to change the world you all are going to respect the facts and make the world better than it is today and so i really appreciate you including me now i know many of you are not in the u.s and may not be familiar with my work but i'm a mechanical engineer i had i worked on this this gizmo to suppress vibrations in the yolk the steering wheel of the 747 that was that was cool it's the kind of thing they give to a young guy all right you when you're fresh out of school you can still do this this uh arcane sort of off in the corner math and uh so that was that was quite tremendous uh loyalty to boeing uh for a long time but when i was in college people were just beginning to talk about our university people were just beginning to talk about climate change and climate change you could say these days was discovered on venus the planet venus when james hansen who's still around and testified in front of the u.s congress in 1988 about climate change pointed out that the venus is kept warm by carbon dioxide in the air in the venusian air and uh it's happening here on earth because humans are putting extra carbon monoxide in the air not just in the air but putting it into the air fast extremely fast and that's the problem and then when i was in school this famous astronomer carl sagan and another guy jim pollock had done this analysis that if you set off all the nuclear weapons in the world at the same time you'd create this huge dust cloud that would shade the earth for weeks and weeks or months and months and then walter and louis alvarez found uh the crater that almost certainly it was was a result of the impact which created a dust cloud so big how big was it bigger than the earth's diameter that it finished off the ancient dinosaurs they may have been go the ancient dinosaurs may have been having trouble with other atmospheric problems but that impact is what did them in anyway these three things came together for me in the 1980s and then in the 1990s and i got very concerned about climate change i got very concerned about science education i got very concerned about engineering and so you guys i'm so old how old are you i'm so old when i was in school we had punch cards when i worked at sunstrand data control after boeing on black boxes which really are black to radiate heat we had fortran seven anyway uh a lot has changed in the last few years but the principles of data management and computing are the same just that you all are able to have the machine learn from its mistakes in a way that we just could not do so my awareness of climate change has been with me for decades and just recently uh entities that concern themselves with these problems national internets of space administration here in the u.s the intergovernmental panel on climate change at the united nations and elsewhere have realized that this year is almost or it's tied for the hottest year on record the hottest year on record was 2016 which was a hot year combined with an el nino and el nino is this warming of the pacific ocean that is its effects are well known but its cause is still a mystery anyway in the el nino year uh it was the hottest year on rec 2020 was almost exactly the same uh hot hotness overall warmth but without an el nino so the world's getting warmer and warmer faster and faster that's where you all come in you are going to dare i say it change the world and uh you're going to do it by doing more with less so i uh for those of you around the world who may or may not have heard of this thing i claim there are three things you want to do for everybody on earth you want to provide clean water renewably produced reliable electricity and access to the internet for everybody in the world and in order to do that when people talk about uh clean energy we're talking about electricity you know electricity is magical you can do amazing things when you have access to electricity and so you can make toast for your breakfast or you can or you can have a conference like this electronically engaging people all over the world and so along this line there's an organization here in the states and that works with people around the world called the solutions project and they have done an analysis and look you guys i'm not saying this analysis is perfect but it's something to think about they have done an analysis of starting with the united states that you could provide clean energy to everybody in the u.s right now if you just decided to do it you would have geothermal energy where possible wind energy solar energy not shut down any existing nuclear power plants but don't try to build anymore because it takes too long to get them done and you use a lot of concrete and there's no place nobody wants the waste even if you had a great idea of what to do with the waste nobody wants it and they have continued the analysis or they they say that we'd get about almost three million jobs just to run all this new infrastructure and you know infrastructure is a big word right now in the u.s but they claim you could do it around the world now let's say they're off by a factor of two or three this would be a huge start if we had a hundred percent clean energy in around the world we would solve an enormous number of problems for a great many people and everybody the whole idea that this can't be done is to me is is misguided it just looks like it can be done from an engineering standpoint geothermal wind solar new better battery storage and perhaps in the next 15 years even or let's say 20 years we will have uh something like fusion energy and of course it will come to the developed world first there will be problems with it but in the next 40 or 50 years it's very reasonable for a change that that could be doable if that's possible we could dare i say it change the world and so along this line there's something that's very important that's even a bigger idea than clean water renewable electricity access to the internet and that is raising the standard of living of women and girls when you raise the standard of living of women and girls everybody's life is better everybody's life uh is improved uh when you have twice as many educated people in the world you solve problems in new ways study after study has shown that the more diverse the team you have trying to solve a problem the faster you get to a more satisfactory or better answer so uh you guys maybe you heard that that's a truck outside the window because we're doing this electronically because we couldn't get together in uh northern california in the u.s this year there it goes carry on you guys anyway uh we were able to do this because you all have created this you all have created this remarkable technology now raising the standard of living of women and girls is not an easy task if you know where to look the family myth is that my great-grandmother is in this picture this is in the united states in washington d.c i grew up in the city of washington d.c in the city limits and the family myth is that my great-grandmother sally was a suffragist marching in this parade the same week i guess three days or two days after her first grandchild had been born and you know and a lot of people a lot of grandmothers get very engaged with grandchildren but no no she was out marching in the suffragist parade and that was you guys well over a hundred years ago you know that family myth was passed down to me but i can tell you this for reals as an eyewitness my mother marched in this in the equal rights amendment parade here in the united states where women wanted the rights to vote and by friends of course these are citizens of country and so they should have the right to vote and just for those of you into the details the equal rights amendment was brought up especially in 1973 is when it really got going and there's a rule here law here in the united states you've got if there's 50 states 38 of them or two-thirds of them have to agree to it well the commonwealth of virginia has agreed or ratified the equal rights amendment and uh it's number 38 but there's all these issues of has it expired has it been there too long before enough states ratified it anyway all that aside the world is changing whether or not this equal rights amendment here in the world's uh influential technologically nominally sophisticated society the world's third most populous nation whether or not that happens this year or next year it will certainly happen in the coming decades for me in my part i would like to happen this weekend but it it'll happen sooner or later the world is changing and that's exciting the world's getting warmer and extraordinary extraordinarily fast women are being empowered in a way they have not been in millennia and we are all gonna have to get used to changing and that is your business that's what you are into the future is open you all build software and now uh combinations of software and machines that learn from their experiences electronically automatically softwareically and so you all are the future and i hope i very much hope you will take this responsibility very seriously so that you can change the world now speaking of which uh the same guy who wrote co-wrote this computer program in 1980 carl sagan through a remarkable series of happenstances was my college professor for one class and he started the planetary society 1980 the world's we are the world's largest non-governmental space interest organization advancing space science and exploration so that people everywhere know the cosmos and our place within it then the elevator door closes no that's you the planetary society through another extraordinary set of happenstances not only did i take one course from this guy and join as a charter member in 1980 now i'm the ceo of the planetary society and we work to get governments and people around the world to support space science and exploration especially the exploration of planets so you guys we are living at this extraordinary time the perseverance rover just landed on mars and it's got two microphones on it and i've been in meetings you all where i won't say with whom i was in these meetings where people oh we don't need to put microphones on mars no we know the composition of the martian atmosphere we know exactly what it would sound like you don't even need to do that that's a waste of time waste of weight on the spacecraft and you know it's a waste of bandwidth for a telemetry sending information back to you don't need to do that okay we put those microphones on there and last week the ingenuity rotorcraft this helicopter the tip the tips of the blades go at about 0.6 mach 747 can get up easily up to that speed but on this extra in this extraordinarily thin atmosphere this other world and these uh scientists saw the dust move on the surface of mars in a completely unexpected way and they heard the sounds for the first time and i claimed the planetary site was instrumental pun intended on putting those instruments on mars we got a microphone on mars in 1999 mars polar lander but it became mars polar crasher when these retro rockets didn't fire with a software problem the landing gear extended the software thought if i can use that expression thought it hit the surface but no it was in mid-space and just barreled in anyway carl sagan and the planetary society have changed my life when you look at this image it is a way of addressing these two questions we all ask we all ask these questions where did we all come from and are we alone in the cosmos if you want to answer those questions you have to explore space and you have to explore mars so those layers of the kodiak mesa which is the tentative name for this feature on mars are sedimentary layers under a cap rock that's how a mesa forms this place used to be soaking wet there used to be a gigantic body of water flowing through here are there fossilized microbes in those layers what on earth we call stromatolites and in the background there you guys that's not like dust in the atmosphere that's the wall of the jezeru crater jezero crater this enormous feature on mars that may have signs of life because it was a delta of some ancient body or of water or river system on mars and you all will very well very likely be alive when this discovery is verified or set aside this hypothesis is embraced or set aside if we found evidence of life on another world it would change this one it would change the way every one of us feels about being a living thing in the cosmos and i claim there's a great lesson to be learned from this the exploration of planets is a metaphor or a way of thinking about exploration in general if you stop looking up and out what does that say about you whatever it is is not good so that's why we advocate for the exploration of other planets not just here in the united states we are an international organization 100 members in 130 countries around the world check us out carl sagan in class also told us this story back in the disco era johann kepler was looking at the night sky and he observed what the comet that nowadays we refer to as comet holly you may have used the expression halley's comet that's fine but they found people from the holley family they prefer to say holly and now the astronomical tradition is to say comet blank comet holly kepler was looking at this thing and he realized there's something about sunlight that was creating these tails there's the dust tail and the ion tail these charged particles that stream behind the comet as the dirty snowball as it's called gets volatized by sunlight so he speculated in 1607 400 years ago that he said people will sail the cosmos the way we sail on the ocean and indeed carl sagan presented this idea to this very famous talk show host here in the united states johnny carson back in 1976 september of 1976 we'll build a spacecraft so big and shiny how big and shiny would it be that sunlight would just push it through space without any mass exchange in other words it's not the solar wind it's not the particles that will push this thing it's photons pure energy no mass and this has been speculated about since kepler's time verified in the 1920s really when relativity was discovered and the quantum was investigated and he proposed this he and his colleagues proposed a spacecraft to catch up with comet holly in 19 launched a night to be launched in 1980 catch it in 1986. well it didn't get built the space station and international space station and the u.s space shuttle program superseded and the arion you know the rockets european rockets and japanese rockets they were built in exchange for this mission all right the planetary society pursued this for years we built the cosmos one solar sail it ended up in the barents sea which is an area if you're savvy about global ocean-going navigation it's part of the arctic ocean it's an area it just crashed all right well five years ago the planetary side managed to build light sail one i was on a very famous talk show here in the united states the late show with stephen colbert a guy who succeeded essentially johnny carson after decades talking about our solar sail mission which is absolutely derived from carl sagan and his colleagues his mission and johann kepler and we got it to work you guys light sail 2 is flying right now light sail 2 you can check us out at planetary.org and see where it is and we proved that if you all are the sun and the microphone is the earth we could come at you like this twist in space get a push twist get a push and build orbital energy even in the gravity environment near the earth where making this maneuver is hard because of the inertia of this large uh spacecraft for those of you who remember anything about your fundamental physics i know you're all software guys but it's still true physics is still as we say things happen for a reason that reason's usually physics all right we built the spacecraft to prove this uh for space agencies around the world and private and public space explorers space exploring corporations because this technology might enable us to take cargo to mars for free there's no rocket fuel kind of thing so that was the intent and along with that we built these cameras uh in the spacecraft that are very high quality they're made by a company here in the states called the aerospace corporation and it took our engineers men and women who are here in california mostly but also at georgia tech uh georgia is a state in the u.s and there's some people that work now and then when they have time in a community college in hawaii and they have messed with the software over the last two years and they're able to get these extraordinary pictures this was the launch it was a light sail i mean rather excuse me it was a falcon heavy a spacex falcon heavy and you know uh that company is very pridefully led by this famous guy here in the states elon musk and they used boosters that had been already been flown they strapped him onto this thing and had three of these giant boosters 27 engines your chest shakes your hair shakes oh man it was just amazing and so we got the thing in orbit and the software people over the last two years have been able to send down these fantastic pictures and this is what astronauts call astronauts of all nationalities refer to as the overview effect when you see the earth from space it changes the way you feel about the earth and it changes the way you feel about living on the earth and you guys what's happened over the last millennium 100 000 years instead of humans being controlled by nature or what you might think of as everything not human now humans we are in charge of nature we are running the show this is the nile river and a bit of the suez canal from our spacecraft it changes the way you feel about being a living thing are we alone where did we come from where did we come from well now we're running the earth we're in charge and that's where you all are going to change the world the key to the future is not to do less it's not to not drive not eat not wash your clothes the key to the future is doing more with less and that's where software and machine learning and artificial intelligence to me are going to change the world now here's a famous picture taken by the cassini spacecraft cassini was a mathematician who predicted or showed where the gaps in the rings of saturn would be and he was right and this picture was taken a few years ago and it's amazing there's a saturn is gorgeous the sunlight coming from the south pole of saturn what we've learned about the weather on saturn has affected or influenced our analysis of climate change here on earth and speaking of the earth this is not just a beautiful picture of saturn it's a picture of the earth for those of you unfamiliar with this the earth's right here that's it that's the whole thing that's everybody if we fly up here this way about a hundred thousand kilometers there's the same picture saturn's not would now be below us if we were in a spacecraft taking this picture do you see the earth it's right here that's the earth that's everyone you've ever met that's every fungus it's every tree it's every sea jelly carl sagan was very eloquent about this too you know every every emperor everyone who tried to control a corner of this dot in space they all lived here they've all come and gone here there's nobody coming to save us if we mess up the environment to the point where many of us cannot have a quality of life that suitable for carrying on our generations of people it's our fault this is it this is where we make our stand this is the earth this is our home this is where artificial intelligence and the future being open to new ideas this is where you all are going to dare i say it change the world thank you all very much for including me in this conference you guys now guys and gals get out there and change the world for the better clean water renewable electricity access to the internet to raise the standard of living women and girls as we run the show here on earth and make it better for everyone thank you so this concludes our second day morning keynotes thank you all for tuning in we have really exciting keynotes and sessions in the afternoon on behalf of all databricks thank you for tuning in [Music] [Music] you

Info

Channel: Databricks

Views: 5,298

Rating: 5 out of 5

Keywords: Databricks

Id: VLLTuMBARss

Channel Id: undefined

Length: 137min 37sec (8257 seconds)

Published: Thu May 27 2021