Web AI: On-device machine learning models and tools for your next project

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone I'm Jason ma web AI lead here at Google and I'm also joined by my colleague Nali  who's our on device machine learning tooling   lead who be talking in just a few minutes oh my  my tricker um now I want to start by formally   defining what web AI actually is this is the art  of using machine learning models client in the web   browser running on your own devices CPU or GPU by  using JavaScript and surrounding web Technologies   like web assembly and web GPU for acceleration  this is different from cloud AI whereby the model   would execute on the server side and be accessed  by some kind of API and as you may have guessed a   lot has changed in web AI since 2023 so I want to  start by highlighting key areas you'll learn about   in today's talk as we'll be giving updates from a  whole bunch of Google teams working in this space   you'll go from learning how to run our brand new  large language models in the browser at incredible   speeds with no servicei calls after the page load  to showing you the impact of running models client   side to make your company even more cost effective  when creating real business applications like   video conferencing or what about going from idea  to prototype faster in our new collaboration with   hugging face to a taste of the future where you  can talk to visual blocks or low code framework   and have it build a pipeline for you in seconds  using our latest research publication and we've   even got updates from the Chrome team for how  they're enabling JavaScript developers to leverage   web AI at Chrome scale using Technologies like web  GPU and web assembly and even new AI focused apis   at the browser level so with that let's get going  first off what difference does one year actually   make well we're pleased to announce that we cross  1 billion cumulative downloads of media pipe and   tensorflow GS libraries and models for the first  time in fact over the last two years we averaged   600 million downloads per year bringing us to over  1.2 billion downloads in that time frame and we're   on track in 2024 to continue that growing of  Usage Now focusing in on poent Flow JS Library   alone you can see the steady rise since 2020  in this graph from npm where developers really   started investing in web AI now note that this  graph shows weekly downloads so we're currently   seeing around 1.1 million monthly downloads  just via mpm and shifting to the next chart   this shows the number of content deliver Network  downloads from JS deliver just for the month of   January 2024 we had over 11.8 million downloads  of a library in that time and again we expect to   see usage increase over the coming year as web AI  is is explored by more developers than ever before   in production use cases now what does this mean  for business well in addition to a better user   experience due to reduced latency when you bring  an AI model into the web browser you also gain   privacy for the end user along with significant  cost savings too let's take video conferencing   as a solution uh as an example and you know  many of these examples of a video conferencing   provide background blur or background removal  in video calls for privacy so let's crunch some   hypothetical numbers for the value of using client  side AI in an application like this so first up   a webcam T typically produces video at 30 frames  per second so assuming the average meeting is 30   minutes in length that's 54,000 frames you've got  to blur the background for and assuming 1 million   meetings per day for a popular service that's  basically 54 billion segmentations every single   day now even if we assume a really ultra low cost  of just 0.00001 cents per segmentation that would   still be 5.4 million per day which is around $2  billion a year for server side GPU compute costs   by performing backgrounds bling on the client  side via web AI that cost goes away and don't   forget you can even Port other models to the  browser too such as background noise removal   to improve the meeting experience for your  users while still getting all those benefits   now speaking of production use cases 2024 has been  quite the year for bringing gen to the browser by   The Wider JavaScript community and we at Google  have some new offerings too first up I want to   speak about Gemma web this is a new open model  that can run in the browser on a user's device   built from the same research and technology  that we use to create Gemini by bringing in   large language model to run on device it can  save significant costs of running on a Cloud   Server for inference along with enhancing privacy  and reducing latency in the example shown here   you can see how the user is able to generate an  email to their friend for some given context and   certain requirements without any of the text being  sent to the server even better this runs really   really fast the demo you see here is captured in  real time now you could imagine turning something   like this into a Chrome extension whereby you can  highlight any text in the web page right click   and convert it to some form that you can post on  social media or or maybe to find a word that you   don't understand all in just a few clicks for  anything you come across instead of having to   go to a third party website to do that in fact  I did that right here in this demo that I made   in just a few hours on the weekend entirely in  JavaScript client side in the browser and there's   so many creative ideas waiting to be made here  so from Chrome extensions that supercharge your   prod productivity to Features within the web app  itself we're at the start of a new era that can   really enhance your web experience and the time  is now for all of you to start exploring those ideas in fact right now gen in the browser is  in its early stages but as Hardware continues   to get better with more CPU and GPU Ram  becoming commonplace we'll continue to   see models like this ported to the browser on  device enabling businesses to reimagine what   it can do on a web page especially for IND or  task specific situations where the weights of   smaller llms in the range of 2 to 8 billion  parameters be tuned for a specific purpose   uh on consumer Hardware which brings me to  my next update our brand new large language   model inference API this API supports four  leading open architectures Out of the Box   accelerated by both CPU and GPU all of these  are easy to use by a common API to load and   run right in the browser on device without  any server calls after the page load by any   front-end developer and with speeds that are  well beyond the average human reading speed   so let's learn a bit more about each of  these first up we've got Gemma 2B this is   a lightweight state-ofthe-art open model that's  well suited for a variety of text generation tasks   including question answering summarization and  even reasoning we recommend using GMA 2B that's   available to download on kagle models and comes in  a format that's compatible with our llm inference   API if you take one of the other supported  architectures on the following slides you need   to convert them sort of runtime that can be used  in our API using our converter library but I'll   talk about in just a few slides time next up is  5 2 this is a 2.7 billion parameter Transformer   model best suited for question answer situations  chat and code formatting and we've also got Falcon   RW 1B which is a 1 billion parameter model trained  on 350 billion tokens using the refined web data   set and finally we've got stable LM 3B which is  a 3 billion parameter decoda only language model   pre-trained on one trillion tokens of diverse  English and code data sets so with all that in   mind what's the performance like as you can see  from the table depending on the client side device   you're actually running on and which architecture  you choose to use you've got different runtime   memory usages and token generation speeds however  it should be noted that it's estimated for around   0.75 words represented per token so that means our  fast test our fastest tested setup reached around   64 words per second and our lowest scoring setup  is about 10 words per second so given that the   average human reading speed is around four words  per second this is still over two times faster   than the average person could read so that's  probably good enough for most situations and with   time this is only going to get better as Hardware  improves over time in fact one could Envision a   hybrid approach right now whereby if a client  machine is powerful enough you download and run   the model there only falling back to Cloud AI in  times when the device is not able to run the model   which might be true for older devices where the  CPU or GPU Ram might be quite small with time vot   more and more computes can be performed on device  so your return of investment should get better as   time goes by when implementing an approach like  this okay so how hard is it to use well it's   actually pretty straightforward in fact it fits on  a single slide let's quickly walk you through it   first you're going to import the media pipe llm  inference API using the standard JavaScript import   statements next you're going to Define your large  language model uh wherever it's hosted somewhere   on the internet and you would have downloaded that  from one of the previous links on the slides once   you downloaded and hosted that model and set the  correct cause headers you can then use it in your   web application now you define a new asynchronous  function that will actually load and use the model   and inside this you can specify the file set URL  that defines the desired media pipe runtime to use   now this is using the default one that media pipe  provides and hosts for you and this is safe for   you to use as well however if you really wanted  to you could save this file on your own CDN or   server and host it there too next you can use the  file set URL from the PRI line to initialize media   pipes file set resolver to actually download and  use the runtime for the Gen task you're about to   perform now you can load the model by calling llm  task. create from model path to which you pass the   file set and model URL that you defined above as  the model is a large file you must await for that   to finish loading after which it'll return the  loaded model which you assign to a variable called   llm there on the left hand side now you've got the  model loaded you can now use it to generate text   um just by giving some input text as a parameter  and you can store a text result in a um in a   variable called answer on the left hand side there  and with that you can then log the answer display   on screen or do something with a knowledge  that comes back note if you want to stream   results instead of writing at the very very end  you can simply pass a function in a second call   there which will scream partial results as they  they become available which you can inject into   your web page as we're available to get that nice  streaming effect that you see on all the online   web chat applications and that's pretty much it  so now just call your init LM function to kick off   the loading process above and wait for the results  printed so given that you now know how to load and   run these models we're also pleased to announce  but by using any of these four architectures   you'll be able to load custom tuned weights too  not just the pre-made ones that we've made for   you so that means you could distill or fine-tune  your own versions of these models to one of these   Target architectures and then convert that to the  client side model format and be instantly able   to run your own custom Tain model right in the  browser with comparable speeds to what we just   just saw as long as your weights fit one of those  architectures and is of the same size to do that   learn more on the link shown to learn all about  that okay so as you've seen the llm inference   API lets you run large language models completely  in the browser on the user's device for your next   web application you could use an llm to perform a  wide range of tasks that were previously just not   possible in JavaScript alone such as generating  text answering questions about a document that's   being viewed or even defining some text on the web  page in where you can B understand and even better   you can do it at Great speeds too so check the  link on this slide to go deeper into that API and   try things for yourself and we're looking forward  to seeing what you all create now on that note if   you do make something cool or manage to Benchmark  some of these llms on your own devices we'd love   to see your results so use the community hash  web a on social so we can share knowledge as a   community and this also gives you a chance to be  featured at our future events next up I want to   hand over to Nar to talk about updates for the  visual blocks framework that we launched last   year along with the collaboration with hugging  base that we've both been working on thank you thank you Jason hi everyone last year  we launched vblock a nood machine learning   prototyping tool to enable developers and  decision makers to work together when using   machine learning this allowed users to focus  on the problem they're actually trying to solve   instead of being blocked by code complexity  and and Technical barriers all key features   are neatly packaged in a not graph editor as  shown out of the box users can select from a   suite of pre-made nodes to for perform common  useful tasks like getting data from a webcam or   microphone or visualizing the outputs of an AI  model and when you drag out from one of these   nodes they can suggest valid things they're  able to connect to in this manner you can   quickly create an endtoend prototype that you can  share with your wider te enabling anyone anywhere   to try what you've made on their own machine with  their own data and input devices or even customize   the flow as they need to explore other related  ideas now this year we're pleased to announce   a collaboration with hugging face who have created  16 brand new custom nodes for vblocks bringing the   power of transformers. JS and The Wider hugging  face ecosystem to the VIS blocks framework   that you can now all use to eight of these new  nodes run entirely client side via web AI let's   walk through the huging face collection to see the  superpowers you can get out of the box that can   help bring your ideas to life first up is image  segmentation as you can see you can pass the model   and image and then click a part of the rendered  image to reveiew just the pixels that belong to   the object you clicked on now prev previously V  blocks shipped with a person segmentation model   but here huging face have extended this ability  further you can click multiple areas on the image   to then combine object segmentation and view the  results in real time so depending what you want   to segment you can choose the most suitable  model for example for portraits of a person   the face paring model may be a good fit but for  clothing the sa former bz clothes model May Fair   better try them all out today with the link shown  next translation a brand new node we did not have   before here you can take any piece of text pick  language of your choice from the noes drop down   box and then have your input Tex converted to  the desired language there are five variations   of this model to choose from depending on your  requirements with the smallest being 78 megabytes   now you can imagine using this with other notes to  bring powerful ideas to life imagine you also have   a node that can extract text from images in that  case you can feed the text found from the image   into this translation node to convert what you  see around you in the real world to something you   can understand when you were on holiday or abroad  just like Google Lens does but in the web browser   there's a lot of potential to get really creative  here especially when combining with other vblocks   nodes next up is token classification what's  that well given some sentence it can extract   words that are in some way meaningful such as  locations companies or names of people that   are found in the sentence as shown on this slide  having the ability to extract useful information   from a long sentence could help you perform a  more powerful search or understand your users's   intent in Greater detail again you can choose from  several models depending on your needs moving on   you also have the hello world of machine learning  image classification and object detection you can   select from four new classification models and  two new object detection models including rest   net and YOLO variants it should be noted many  of these models were trained on the image net   1K data set that did not contain people in the  training data so while these models may not be   well on images of people they're pretty good at  finding animals and other objects as shown in our   example images switching back to Tex models we  also now have a new tax classification node this   allows you to classify tax based on sentiment or  toxicity for example right now you can choose from   the three models provided depending on your needs  with the smallest being 67 megabytes in size next   up background removal this model loves to remove  background from an image now some of you may be   wondering how does this compare to our existing  body segmentation model well the cool thing about   this one is that it doesn't just focus on  people so as you can see here when I remove   the background with an animal in the foreground  it still works just fine pretty neat so give it   a try today using the link shown on the slide  finally we have depth estimation for any given   image the model will try and estimate how far  away each pixel is from the Viewpoint for subtle   movements like you see here this can help give the  illusion of a 3D image using any regular 2D image   you can adjust the displacement amount using the  slider in the viewer node to get an effect that   works well for your specific image now in addition  to the client side models on the previous slides   hugging face also support several task-based  nodes that execute a model of your chice via   a server side call using their own apis this  means you can try thousands of models that fall   under one of these supported task types that  can now be all used within bu blocks too so   what are the task types supported you can choose  from F masks image classification summarization   Tex classification text generation Tex to image  or even token classification now as this talk   focuses on client side models we encourage you to  try these actual models out in your own time as   they could complement using web AI models in  a hybrid manner with time I'm sure we'll see   client side variants being produced too as devices  continue to get more powerful so head on over to   the project page at go. gf- vblocks to learn how  you can use the new vblocks noes from hugging face today so how did hugging face make those  custom nodes that work Sly with their own   custom code and apis well today we are pleased to  announce custom noes for vblocks since launching   vblocks early 2023 we've spoken to many of you  throughout the year and time and time again we   saw requests for people wanting to run custom  logic but on that note you wanted to be able   to make customers that could work with all our  existing offerings so you didn't have to start   from a blank canvas especially for common  reusable things such as accessing sensors   like the webcam or common output visualizations  for vision or text models so hello custom nose   buauty nose may not be perfect match for all  use cases but this is where custom nose can   shine even better they're just regular JavaScript  web components specifically the custom elements   implementation so it's really easy to make the  new noes using your fite Frameworks or even   no Frameworks at all as web components are part  of web standards in modern browser now at this   point you may be wondering what can a custom noes  do well the short answer is that if you can write   it in JavaScript you can turn it into one maybe  you have some custom client side logic that can   be turned into a custom node or maybe you want  to call a new third party web API on some remote   server that can be a custom node too and with  that the nodes you create can work with all our   existing ones and even other people Creations  assuming the input output types match allowing   you to innovate Faster by reusing the work of  others so come along and join our Workshop this   year to learn how to make your very own custom  noes from a blank canvas for those of you here   live search for the workshop for vblocks going  on later today and for those tuning in online   you can search for the recording or head to  the code lab link shown to get at your own pace now we look forward to seeing what you all  create if you make a custom node and want a chance   to be featured at our future talks please tag  your demo of it in action using the visual blocks   hashtag on social now then if you can create  custom noes that have well Define inputs and   outputs what could the future of visual blocks  look like as AI itself progress today we would   like to share with you a research project called  instruct pipe which is a collaboration between   many research folk here at Google to give you  a taste of that future today with vblocks to go   from idea to working prototype you can drag out  and connect noes to solve some tasks you have   in mind and this be much faster than coding  each block yourself manually which is great   but what if we could build a co-pilot for visual  programming to go one level higher Ste imagine you   could type a sentence for what you desired and the  vblocks graph will be made for you automatically   powered by the latest in generative AI let's  take a look at our research project in action   here you can see the user simply types a prompt  into the box and the vblocks pipeline is produced   notice how it was not perfect the first time so we  modified the prompt to specify using a transparent   image instead and then we got the desired effect  or what about this trip planner that was made with   a quick prompt and few user inputs to set it  up once the graph was created here most of the   pipeline is completed and the just enter their  use case along with a valid API key at run time   to get the results for the city they desire using  this approach we were able to achieve around 81%   reduction in user interaction freeing up the  end user to focus on the task they're actually   trying to solve instead of connecting wires  to blocks here you can see one more example   of this in action in this case we asked to  turn the image of a tiger into a cat but to do   that we use the Palm API to describe the existing  image first and then use that to prompt Google's   imagine image generation model which produces  the output we desired that has a cat looking in   a similar manner really incredible stuff and this  is just scratching the surface of what could be   possible in the future as multimodels continue  to get better with time so if you were excited   by what you just saw learn more about this  research on the link shown to read our full paper next up model Explorer if you spend a lot  of time building testing and deploying ml models   you know how important it is to understand what's  H happening under the hood for example how nodes   are related how they are structured and how they  are performing Google machine learning Engineers   face this problem every day so we built a tool  we call model Explorer to make model debugging   more intuitive and easier now we are excited to  share model Explorer with the world so that the   entire ml Community can benefit model Explorer  supports multiple model formats including Jacks   and tensorflow tensorflow light and tensorflow JS  files many teams in Google use model Explorer in   their daily work such as Gemini Chrome and YouTube  so let's see it in action here I'm demonstrating   a gen diffusion model that can run in the browser  you can see that even navigating very large models   is smooth uncluttered and low latency we can  navigate the graph layer by layer clicking on the   layer highlights similar ones saving you time and  clicks and the properties pen gives you detailed   information about each layer and note we designed  model Explorer with usability in mind here's a few   examples showing how you could leverage these  features to enhance your debugging experience   first of all we provide a bookmark feature that  allows you to easily jump back and forth between   different areas of graph another notable feature  is a color palette that allows you to annotate   noes with colors of your ches allowing you  to find similar noes you care about easily model Explorer allows you to easily Traverse  analyze and debug machine learning models of   almost any size and complexity and is now free for  everyone to use to get started just visit the link   shown on the slide okay now I will head back to  Jason to cover what Chrome has been up [Applause] to thank you n so next up we've collaborated  with chrome who've been investing in web AI as   well this year now I highly recommend you check  out their IO talks for all the details but I want   to give them a shout out for some highlights  already in this talk you've seen web AI in   action where it's loaded and run within the web  page itself like our Gemma large language model   but what if the model was already there for any  site to use built into the browser that way you   wouldn't need to download your own llm and instead  it could be used across domains by a standardized   JavaScript API so imagine this what if you could  get the model to do what you need like summarizing   a large chunk of text on a web page or a blog  post or whatever it might be or make something   technical easy to understand without having to  Master machine learning model creation skills   what superpowers could you all get as developers  well the web is so much bigger than our team so   we've been speaking to all of you about Ai and  its challenges to find out what's on your minds   to help shape the future of AI in the browser you  can check the summary of our findings by a link   shown on this slide and we welcome further  feedback if you've got ideas too including   thoughts on Chrome's built-in AI approach to  solve the key challenges we've heard that you've   been facing use the hash web to on social to  tell us your thoughts and on that note we're   also working on a new website providing guidance  specifically for web developers choosing to use   AI with this site we aim to help you understand  key AI Concepts so you can discover opportunities   to use popular models to be more productive  than ever and to use gen to build delightful   user experiences using existing tools models and  apis bitmark the site shown on this slide as we   continue to publish more content to here over  the year all right going deeper down the stack   Chrome also have updates for web GPU this year we  can now support 16bit floating Point values for   gpus that can support it but why is that important  well as you can see from the screenshot using 32   bits to store model weights for this very large  language model resulted in around 11 tokens per   seconds for decoding data but by using 16 bits  this increased 45% to 16 tokens per seconds and   uses half as much memory to store those weights  next updates for web assembly I know a number of   folk in a web web AI Community have had issues  creating web apps when you try and load models   larger than 4 gigabytes in size with the new  memory 64 proposal there's now exploration to   support 64-bit memory indexes which would allow  you to load larger AI models than you could do   so before check the link for more details also  support for JPI is now available to trial that   enables better interoperability between web  assembly and other JavaScript apis like web GPU   this essentially Bridges the gap between  asynchronous applications sorry between   synchronous applications and asynchronous web  apis which I know some of you would be happy   to hear you can keep track of this proposal and  others on the link shown finally in this section   Chrome is enabling its translation and speech  recognition apis to now work entirely offline   allowing you to go offline with your own web apps  and to have these Advanced features powering your   user experience it's great to see increasingly  Advanced features able to run on device and I   believe this is a trend we'll continue to  see throughout the year and in the future   especially as models shrink in size for various  tasks so with that do check out Chrome's talks   on the web track to learn more about what they've  been up to finally I'd like to give out a a shout   out for something I know you all love to deal  with testing your AI models doing this on the   server side is very well documented and fairly  straightforward but what if you want to test a   client side model in a real browser environment  to see if it performs well using Technologies   like web GPU and web assembly well I'm pleased  to announce I made a solution you can all try   today that allows you to do just that for any web  application that needs webg or web GPU support   here you can see the difference in performance  in the CPU on the left hand side which is what   you get by default when running headless Chrome  versus the GPU on the right hand side when Chrome   correctly uses the server side GPU as you can see  this can seriously increase your testing speed and   allows you to verify even larger gen models that  they correctly work in these browser environments   before you actually make it public to the world to  use now if you need to test web models in a more   scalable and reproducible way then head on over  to our blog post write up to learn more or grab   the code yourself from GitHub to help you on your  testing journey and with that that's a wrap of   this year we really do look forward to seeing what  you create with web AI 2024 is shaping up to be a   great year for progress in the field from gen  running locally to performance improvements   across the board and remember do tag Us in  anything you create if you got any suggestions   for any of the Technologies you've seen in  today's talk feel free to connect with us on   social media if you prefer just search our names  and with that thank you and see you next time [Applause]
Info
Channel: Google for Developers
Views: 18,965
Rating: undefined out of 5
Keywords: Google, developers, pr_pr: Google I/O;, ct:Event - Technical Session;, ct:Stack - AI;, ct:Stack - Web;
Id: PJm8WNajZtw
Channel Id: undefined
Length: 33min 30sec (2010 seconds)
Published: Thu May 16 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.