Web AI: On-device machine learning models and tools for your next project

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everyone I'm Jason ma web AI lead here at Google and I'm also joined by my colleague Nali who's our on device machine learning tooling lead who be talking in just a few minutes oh my my tricker um now I want to start by formally defining what web AI actually is this is the art of using machine learning models client in the web browser running on your own devices CPU or GPU by using JavaScript and surrounding web Technologies like web assembly and web GPU for acceleration this is different from cloud AI whereby the model would execute on the server side and be accessed by some kind of API and as you may have guessed a lot has changed in web AI since 2023 so I want to start by highlighting key areas you'll learn about in today's talk as we'll be giving updates from a whole bunch of Google teams working in this space you'll go from learning how to run our brand new large language models in the browser at incredible speeds with no servicei calls after the page load to showing you the impact of running models client side to make your company even more cost effective when creating real business applications like video conferencing or what about going from idea to prototype faster in our new collaboration with hugging face to a taste of the future where you can talk to visual blocks or low code framework and have it build a pipeline for you in seconds using our latest research publication and we've even got updates from the Chrome team for how they're enabling JavaScript developers to leverage web AI at Chrome scale using Technologies like web GPU and web assembly and even new AI focused apis at the browser level so with that let's get going first off what difference does one year actually make well we're pleased to announce that we cross 1 billion cumulative downloads of media pipe and tensorflow GS libraries and models for the first time in fact over the last two years we averaged 600 million downloads per year bringing us to over 1.2 billion downloads in that time frame and we're on track in 2024 to continue that growing of Usage Now focusing in on poent Flow JS Library alone you can see the steady rise since 2020 in this graph from npm where developers really started investing in web AI now note that this graph shows weekly downloads so we're currently seeing around 1.1 million monthly downloads just via mpm and shifting to the next chart this shows the number of content deliver Network downloads from JS deliver just for the month of January 2024 we had over 11.8 million downloads of a library in that time and again we expect to see usage increase over the coming year as web AI is is explored by more developers than ever before in production use cases now what does this mean for business well in addition to a better user experience due to reduced latency when you bring an AI model into the web browser you also gain privacy for the end user along with significant cost savings too let's take video conferencing as a solution uh as an example and you know many of these examples of a video conferencing provide background blur or background removal in video calls for privacy so let's crunch some hypothetical numbers for the value of using client side AI in an application like this so first up a webcam T typically produces video at 30 frames per second so assuming the average meeting is 30 minutes in length that's 54,000 frames you've got to blur the background for and assuming 1 million meetings per day for a popular service that's basically 54 billion segmentations every single day now even if we assume a really ultra low cost of just 0.00001 cents per segmentation that would still be 5.4 million per day which is around $2 billion a year for server side GPU compute costs by performing backgrounds bling on the client side via web AI that cost goes away and don't forget you can even Port other models to the browser too such as background noise removal to improve the meeting experience for your users while still getting all those benefits now speaking of production use cases 2024 has been quite the year for bringing gen to the browser by The Wider JavaScript community and we at Google have some new offerings too first up I want to speak about Gemma web this is a new open model that can run in the browser on a user's device built from the same research and technology that we use to create Gemini by bringing in large language model to run on device it can save significant costs of running on a Cloud Server for inference along with enhancing privacy and reducing latency in the example shown here you can see how the user is able to generate an email to their friend for some given context and certain requirements without any of the text being sent to the server even better this runs really really fast the demo you see here is captured in real time now you could imagine turning something like this into a Chrome extension whereby you can highlight any text in the web page right click and convert it to some form that you can post on social media or or maybe to find a word that you don't understand all in just a few clicks for anything you come across instead of having to go to a third party website to do that in fact I did that right here in this demo that I made in just a few hours on the weekend entirely in JavaScript client side in the browser and there's so many creative ideas waiting to be made here so from Chrome extensions that supercharge your prod productivity to Features within the web app itself we're at the start of a new era that can really enhance your web experience and the time is now for all of you to start exploring those ideas in fact right now gen in the browser is in its early stages but as Hardware continues to get better with more CPU and GPU Ram becoming commonplace we'll continue to see models like this ported to the browser on device enabling businesses to reimagine what it can do on a web page especially for IND or task specific situations where the weights of smaller llms in the range of 2 to 8 billion parameters be tuned for a specific purpose uh on consumer Hardware which brings me to my next update our brand new large language model inference API this API supports four leading open architectures Out of the Box accelerated by both CPU and GPU all of these are easy to use by a common API to load and run right in the browser on device without any server calls after the page load by any front-end developer and with speeds that are well beyond the average human reading speed so let's learn a bit more about each of these first up we've got Gemma 2B this is a lightweight state-ofthe-art open model that's well suited for a variety of text generation tasks including question answering summarization and even reasoning we recommend using GMA 2B that's available to download on kagle models and comes in a format that's compatible with our llm inference API if you take one of the other supported architectures on the following slides you need to convert them sort of runtime that can be used in our API using our converter library but I'll talk about in just a few slides time next up is 5 2 this is a 2.7 billion parameter Transformer model best suited for question answer situations chat and code formatting and we've also got Falcon RW 1B which is a 1 billion parameter model trained on 350 billion tokens using the refined web data set and finally we've got stable LM 3B which is a 3 billion parameter decoda only language model pre-trained on one trillion tokens of diverse English and code data sets so with all that in mind what's the performance like as you can see from the table depending on the client side device you're actually running on and which architecture you choose to use you've got different runtime memory usages and token generation speeds however it should be noted that it's estimated for around 0.75 words represented per token so that means our fast test our fastest tested setup reached around 64 words per second and our lowest scoring setup is about 10 words per second so given that the average human reading speed is around four words per second this is still over two times faster than the average person could read so that's probably good enough for most situations and with time this is only going to get better as Hardware improves over time in fact one could Envision a hybrid approach right now whereby if a client machine is powerful enough you download and run the model there only falling back to Cloud AI in times when the device is not able to run the model which might be true for older devices where the CPU or GPU Ram might be quite small with time vot more and more computes can be performed on device so your return of investment should get better as time goes by when implementing an approach like this okay so how hard is it to use well it's actually pretty straightforward in fact it fits on a single slide let's quickly walk you through it first you're going to import the media pipe llm inference API using the standard JavaScript import statements next you're going to Define your large language model uh wherever it's hosted somewhere on the internet and you would have downloaded that from one of the previous links on the slides once you downloaded and hosted that model and set the correct cause headers you can then use it in your web application now you define a new asynchronous function that will actually load and use the model and inside this you can specify the file set URL that defines the desired media pipe runtime to use now this is using the default one that media pipe provides and hosts for you and this is safe for you to use as well however if you really wanted to you could save this file on your own CDN or server and host it there too next you can use the file set URL from the PRI line to initialize media pipes file set resolver to actually download and use the runtime for the Gen task you're about to perform now you can load the model by calling llm task. create from model path to which you pass the file set and model URL that you defined above as the model is a large file you must await for that to finish loading after which it'll return the loaded model which you assign to a variable called llm there on the left hand side now you've got the model loaded you can now use it to generate text um just by giving some input text as a parameter and you can store a text result in a um in a variable called answer on the left hand side there and with that you can then log the answer display on screen or do something with a knowledge that comes back note if you want to stream results instead of writing at the very very end you can simply pass a function in a second call there which will scream partial results as they they become available which you can inject into your web page as we're available to get that nice streaming effect that you see on all the online web chat applications and that's pretty much it so now just call your init LM function to kick off the loading process above and wait for the results printed so given that you now know how to load and run these models we're also pleased to announce but by using any of these four architectures you'll be able to load custom tuned weights too not just the pre-made ones that we've made for you so that means you could distill or fine-tune your own versions of these models to one of these Target architectures and then convert that to the client side model format and be instantly able to run your own custom Tain model right in the browser with comparable speeds to what we just just saw as long as your weights fit one of those architectures and is of the same size to do that learn more on the link shown to learn all about that okay so as you've seen the llm inference API lets you run large language models completely in the browser on the user's device for your next web application you could use an llm to perform a wide range of tasks that were previously just not possible in JavaScript alone such as generating text answering questions about a document that's being viewed or even defining some text on the web page in where you can B understand and even better you can do it at Great speeds too so check the link on this slide to go deeper into that API and try things for yourself and we're looking forward to seeing what you all create now on that note if you do make something cool or manage to Benchmark some of these llms on your own devices we'd love to see your results so use the community hash web a on social so we can share knowledge as a community and this also gives you a chance to be featured at our future events next up I want to hand over to Nar to talk about updates for the visual blocks framework that we launched last year along with the collaboration with hugging base that we've both been working on thank you thank you Jason hi everyone last year we launched vblock a nood machine learning prototyping tool to enable developers and decision makers to work together when using machine learning this allowed users to focus on the problem they're actually trying to solve instead of being blocked by code complexity and and Technical barriers all key features are neatly packaged in a not graph editor as shown out of the box users can select from a suite of pre-made nodes to for perform common useful tasks like getting data from a webcam or microphone or visualizing the outputs of an AI model and when you drag out from one of these nodes they can suggest valid things they're able to connect to in this manner you can quickly create an endtoend prototype that you can share with your wider te enabling anyone anywhere to try what you've made on their own machine with their own data and input devices or even customize the flow as they need to explore other related ideas now this year we're pleased to announce a collaboration with hugging face who have created 16 brand new custom nodes for vblocks bringing the power of transformers. JS and The Wider hugging face ecosystem to the VIS blocks framework that you can now all use to eight of these new nodes run entirely client side via web AI let's walk through the huging face collection to see the superpowers you can get out of the box that can help bring your ideas to life first up is image segmentation as you can see you can pass the model and image and then click a part of the rendered image to reveiew just the pixels that belong to the object you clicked on now prev previously V blocks shipped with a person segmentation model but here huging face have extended this ability further you can click multiple areas on the image to then combine object segmentation and view the results in real time so depending what you want to segment you can choose the most suitable model for example for portraits of a person the face paring model may be a good fit but for clothing the sa former bz clothes model May Fair better try them all out today with the link shown next translation a brand new node we did not have before here you can take any piece of text pick language of your choice from the noes drop down box and then have your input Tex converted to the desired language there are five variations of this model to choose from depending on your requirements with the smallest being 78 megabytes now you can imagine using this with other notes to bring powerful ideas to life imagine you also have a node that can extract text from images in that case you can feed the text found from the image into this translation node to convert what you see around you in the real world to something you can understand when you were on holiday or abroad just like Google Lens does but in the web browser there's a lot of potential to get really creative here especially when combining with other vblocks nodes next up is token classification what's that well given some sentence it can extract words that are in some way meaningful such as locations companies or names of people that are found in the sentence as shown on this slide having the ability to extract useful information from a long sentence could help you perform a more powerful search or understand your users's intent in Greater detail again you can choose from several models depending on your needs moving on you also have the hello world of machine learning image classification and object detection you can select from four new classification models and two new object detection models including rest net and YOLO variants it should be noted many of these models were trained on the image net 1K data set that did not contain people in the training data so while these models may not be well on images of people they're pretty good at finding animals and other objects as shown in our example images switching back to Tex models we also now have a new tax classification node this allows you to classify tax based on sentiment or toxicity for example right now you can choose from the three models provided depending on your needs with the smallest being 67 megabytes in size next up background removal this model loves to remove background from an image now some of you may be wondering how does this compare to our existing body segmentation model well the cool thing about this one is that it doesn't just focus on people so as you can see here when I remove the background with an animal in the foreground it still works just fine pretty neat so give it a try today using the link shown on the slide finally we have depth estimation for any given image the model will try and estimate how far away each pixel is from the Viewpoint for subtle movements like you see here this can help give the illusion of a 3D image using any regular 2D image you can adjust the displacement amount using the slider in the viewer node to get an effect that works well for your specific image now in addition to the client side models on the previous slides hugging face also support several task-based nodes that execute a model of your chice via a server side call using their own apis this means you can try thousands of models that fall under one of these supported task types that can now be all used within bu blocks too so what are the task types supported you can choose from F masks image classification summarization Tex classification text generation Tex to image or even token classification now as this talk focuses on client side models we encourage you to try these actual models out in your own time as they could complement using web AI models in a hybrid manner with time I'm sure we'll see client side variants being produced too as devices continue to get more powerful so head on over to the project page at go. gf- vblocks to learn how you can use the new vblocks noes from hugging face today so how did hugging face make those custom nodes that work Sly with their own custom code and apis well today we are pleased to announce custom noes for vblocks since launching vblocks early 2023 we've spoken to many of you throughout the year and time and time again we saw requests for people wanting to run custom logic but on that note you wanted to be able to make customers that could work with all our existing offerings so you didn't have to start from a blank canvas especially for common reusable things such as accessing sensors like the webcam or common output visualizations for vision or text models so hello custom nose buauty nose may not be perfect match for all use cases but this is where custom nose can shine even better they're just regular JavaScript web components specifically the custom elements implementation so it's really easy to make the new noes using your fite Frameworks or even no Frameworks at all as web components are part of web standards in modern browser now at this point you may be wondering what can a custom noes do well the short answer is that if you can write it in JavaScript you can turn it into one maybe you have some custom client side logic that can be turned into a custom node or maybe you want to call a new third party web API on some remote server that can be a custom node too and with that the nodes you create can work with all our existing ones and even other people Creations assuming the input output types match allowing you to innovate Faster by reusing the work of others so come along and join our Workshop this year to learn how to make your very own custom noes from a blank canvas for those of you here live search for the workshop for vblocks going on later today and for those tuning in online you can search for the recording or head to the code lab link shown to get at your own pace now we look forward to seeing what you all create if you make a custom node and want a chance to be featured at our future talks please tag your demo of it in action using the visual blocks hashtag on social now then if you can create custom noes that have well Define inputs and outputs what could the future of visual blocks look like as AI itself progress today we would like to share with you a research project called instruct pipe which is a collaboration between many research folk here at Google to give you a taste of that future today with vblocks to go from idea to working prototype you can drag out and connect noes to solve some tasks you have in mind and this be much faster than coding each block yourself manually which is great but what if we could build a co-pilot for visual programming to go one level higher Ste imagine you could type a sentence for what you desired and the vblocks graph will be made for you automatically powered by the latest in generative AI let's take a look at our research project in action here you can see the user simply types a prompt into the box and the vblocks pipeline is produced notice how it was not perfect the first time so we modified the prompt to specify using a transparent image instead and then we got the desired effect or what about this trip planner that was made with a quick prompt and few user inputs to set it up once the graph was created here most of the pipeline is completed and the just enter their use case along with a valid API key at run time to get the results for the city they desire using this approach we were able to achieve around 81% reduction in user interaction freeing up the end user to focus on the task they're actually trying to solve instead of connecting wires to blocks here you can see one more example of this in action in this case we asked to turn the image of a tiger into a cat but to do that we use the Palm API to describe the existing image first and then use that to prompt Google's imagine image generation model which produces the output we desired that has a cat looking in a similar manner really incredible stuff and this is just scratching the surface of what could be possible in the future as multimodels continue to get better with time so if you were excited by what you just saw learn more about this research on the link shown to read our full paper next up model Explorer if you spend a lot of time building testing and deploying ml models you know how important it is to understand what's H happening under the hood for example how nodes are related how they are structured and how they are performing Google machine learning Engineers face this problem every day so we built a tool we call model Explorer to make model debugging more intuitive and easier now we are excited to share model Explorer with the world so that the entire ml Community can benefit model Explorer supports multiple model formats including Jacks and tensorflow tensorflow light and tensorflow JS files many teams in Google use model Explorer in their daily work such as Gemini Chrome and YouTube so let's see it in action here I'm demonstrating a gen diffusion model that can run in the browser you can see that even navigating very large models is smooth uncluttered and low latency we can navigate the graph layer by layer clicking on the layer highlights similar ones saving you time and clicks and the properties pen gives you detailed information about each layer and note we designed model Explorer with usability in mind here's a few examples showing how you could leverage these features to enhance your debugging experience first of all we provide a bookmark feature that allows you to easily jump back and forth between different areas of graph another notable feature is a color palette that allows you to annotate noes with colors of your ches allowing you to find similar noes you care about easily model Explorer allows you to easily Traverse analyze and debug machine learning models of almost any size and complexity and is now free for everyone to use to get started just visit the link shown on the slide okay now I will head back to Jason to cover what Chrome has been up [Applause] to thank you n so next up we've collaborated with chrome who've been investing in web AI as well this year now I highly recommend you check out their IO talks for all the details but I want to give them a shout out for some highlights already in this talk you've seen web AI in action where it's loaded and run within the web page itself like our Gemma large language model but what if the model was already there for any site to use built into the browser that way you wouldn't need to download your own llm and instead it could be used across domains by a standardized JavaScript API so imagine this what if you could get the model to do what you need like summarizing a large chunk of text on a web page or a blog post or whatever it might be or make something technical easy to understand without having to Master machine learning model creation skills what superpowers could you all get as developers well the web is so much bigger than our team so we've been speaking to all of you about Ai and its challenges to find out what's on your minds to help shape the future of AI in the browser you can check the summary of our findings by a link shown on this slide and we welcome further feedback if you've got ideas too including thoughts on Chrome's built-in AI approach to solve the key challenges we've heard that you've been facing use the hash web to on social to tell us your thoughts and on that note we're also working on a new website providing guidance specifically for web developers choosing to use AI with this site we aim to help you understand key AI Concepts so you can discover opportunities to use popular models to be more productive than ever and to use gen to build delightful user experiences using existing tools models and apis bitmark the site shown on this slide as we continue to publish more content to here over the year all right going deeper down the stack Chrome also have updates for web GPU this year we can now support 16bit floating Point values for gpus that can support it but why is that important well as you can see from the screenshot using 32 bits to store model weights for this very large language model resulted in around 11 tokens per seconds for decoding data but by using 16 bits this increased 45% to 16 tokens per seconds and uses half as much memory to store those weights next updates for web assembly I know a number of folk in a web web AI Community have had issues creating web apps when you try and load models larger than 4 gigabytes in size with the new memory 64 proposal there's now exploration to support 64-bit memory indexes which would allow you to load larger AI models than you could do so before check the link for more details also support for JPI is now available to trial that enables better interoperability between web assembly and other JavaScript apis like web GPU this essentially Bridges the gap between asynchronous applications sorry between synchronous applications and asynchronous web apis which I know some of you would be happy to hear you can keep track of this proposal and others on the link shown finally in this section Chrome is enabling its translation and speech recognition apis to now work entirely offline allowing you to go offline with your own web apps and to have these Advanced features powering your user experience it's great to see increasingly Advanced features able to run on device and I believe this is a trend we'll continue to see throughout the year and in the future especially as models shrink in size for various tasks so with that do check out Chrome's talks on the web track to learn more about what they've been up to finally I'd like to give out a a shout out for something I know you all love to deal with testing your AI models doing this on the server side is very well documented and fairly straightforward but what if you want to test a client side model in a real browser environment to see if it performs well using Technologies like web GPU and web assembly well I'm pleased to announce I made a solution you can all try today that allows you to do just that for any web application that needs webg or web GPU support here you can see the difference in performance in the CPU on the left hand side which is what you get by default when running headless Chrome versus the GPU on the right hand side when Chrome correctly uses the server side GPU as you can see this can seriously increase your testing speed and allows you to verify even larger gen models that they correctly work in these browser environments before you actually make it public to the world to use now if you need to test web models in a more scalable and reproducible way then head on over to our blog post write up to learn more or grab the code yourself from GitHub to help you on your testing journey and with that that's a wrap of this year we really do look forward to seeing what you create with web AI 2024 is shaping up to be a great year for progress in the field from gen running locally to performance improvements across the board and remember do tag Us in anything you create if you got any suggestions for any of the Technologies you've seen in today's talk feel free to connect with us on social media if you prefer just search our names and with that thank you and see you next time [Applause]

Info

Channel: Google for Developers

Views: 18,965

Rating: undefined out of 5

Keywords: Google, developers, pr_pr: Google I/O;, ct:Event - Technical Session;, ct:Stack - AI;, ct:Stack - Web;

Id: PJm8WNajZtw

Channel Id: undefined

Length: 33min 30sec (2010 seconds)

Published: Thu May 16 2024