Generative AI powered use cases for data engineers

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[ENERGETIC MUSIC] JASON DAVENPORT: It's great to meet everyone here. My name is Jason Davenport. I am a developer advocate. So today, we're going to cover a few things. First off, we will talk about generative AI, all the cool things that you can do with it, but more importantly, how do we actually create those rubrics of things that we want to do and get that value from AI and all the things with it? Next, we'll talk about some of the different tools that we've recently announced here at Google Cloud for doing different things in generative AI. And then, last, we'll actually pull together, and we'll do a couple short demos using some of our AI products here actually in the Console and see some of the cool things that you can hopefully go and unlock after we're done with the session today. And I will also leave plenty of time for Q&A here for folks who are interested in that, as well. All right, AI, it's the thing that's the fad that will disappear in six months. No, just kidding. We could only hope, right? And we could just go back to ML and stats and all the fun things that are the foundations. Now, if you think about where we've come, especially in 2023 with the advent of all of these large language models, the thing here that we're always trying to think about as we introduce new technologies and changes is, how do we get three things really accomplished? First, how do we think about how we can do more with the same or less? Everyone here has hundreds of things that we're trying to do. And we're always trying to unlock and figure out ways to do things more efficiently, do it better, do it with higher quality. Next, how do we actually innovate? So in this case, how do we do things that we haven't done before, whether that's novel to us, novel to our organizations, or maybe novel to society and all those practices? And third, and I think this is one of the more interesting things with all this new technology, is how do we actually think about differentiation? So doing things like, what are different industries to go into? What are different things from my own skill set that I wasn't able to do before? How can I do those things and do those well and effectively in those processes? However, we have this new pillar, which is incredibly important with LLMs and AI in general, which is, how do we do this securely and at scale, because what we don't want to see is someone goes to Bard and plugs in a question, and they understand that Jason Davenport lives in Denver, Colorado, immediately. That would obviously be very bad. None of us here, I think, are alone in this journey. We did a survey, and 91% of organizations are actually trying to increase all these investments. But really, if we think about tensions and things that are coming against this, only 20% of organizations are actually realizing use cases. And really, then, the question is, as analysts, as scientists, as engineers, how do we close this value gap? Because if we have 91% of our companies trying to do these things but we're only deploying 20%, well, that's obviously a very low realized rate of getting value from all of these different models. Interestingly, five years ago, I think this was almost the same for base ML. And so the question is really, how do we unlock some of those trends and really get that value moving forward? I love these things because when we start talking about trillions of dollars, it becomes immediately immaterial to all of us in the room. I know, trillions, right? It's crazy. I think here, as we see more and more of these statistics about new use cases and productivity and all that, the question that comes to my mind is, how are we using generative AI for the powers of expansion and differentiation? And we'll show some examples of how we can do things using productivity enhancements. But really, here, the goal is to grow. And then how do we make sure that we're doing that with the right tools in that? Some of these value things, quite honestly, I think there was a slide like this probably of data lakes back in 2008 or 2009. And the thing here is the way that we will unlock these things and do more cool things with is not by doing the same things that we're doing today. It's unlocking those new use cases that we never had the opportunity to even consider in the past. Sorry, I messed up the clicker. All right, so why do large language models actually make these things more effective? First, if we think about LLMs as emergent technologies, what this really means is that as an engineer, I can also do things like add marketing copy. That was not something that we could do before the advent of this without either a lot of my time spent on doing something or me having the direct worker knowledge for that. Second, with LLMs, we also really have this contextual understanding of language. And where this becomes really important is if you think of the relationship between cats and dogs, as humans, we can infer the middle part of that, right? Maybe the dog chases the cat. Maybe the cat and the dog, they live in the same house. Before LLMs, we knew that we had cats, and we had attributes of cats. And we knew that we had dogs and attributes of dogs. But to fill that kind of chasm in the middle and understand the intents of the two entities was very difficult, even in things like BERT models and those things that came prior to this. And so with that, now that we can actually understand that context, we can unlock new experiences for this. And then last, and really coming to that connection piece, is if we think about all the data that's available, LLMs really kind of-- if we can direct them effectively are how we can unlock all of that opportunity and understand relationships that may not be immediately apparent to us just because of the volume of data that we have to get through. I think here, as we think about now we're getting into the activation space and how do we get to jobs and tasks that may be useful for LLM use cases, probably the main thing here I would call out is that LLMs in the information economy are really the things that we're looking to to say, OK, how do we take the information economy to that next iteration or that next generation? And as I think we've all seen, change is never a linear function. Change is always stepwise. What are the stepwise things that for, for example, myself as a software engineer, I will be able to unlock moving forward? Or what is my customer experience going to be that really makes it more impactful? So if we think about that, we've started going from roles. Now let's talk about how we're going to actually break apart these things into maybe jobs that users might perform and think about, from those jobs or tasks, how we can then actually get to do some sort of scorecard for prioritization and actually using LLMs for this. And this is using a popular framework called "jobs to be done," for those who are interested. It's a really great way of just breaking down the activities that we do and the value or outcomes that we're trying to drive with that. But thinking through here, if I'm a marketing user-- I'm maybe a data engineer supporting marketing-- what are the things that might benefit from using an LLM for building something like marketing and social copy? As a developer or engineer, one of the things that we always see is testing is great. Testing is always the last thing that I do. And as a result of it, testing is never actually put into my code. Documentation is always something where it's a nice-to-have. And I document it when it's painful, but everything else I don't. So what are those things where we inherently may assign value to that, but we have not been able to do to date at scale that we may actually want to think about as those first things to go through? And then, last but not least, coming to a knowledge worker, what are maybe things that I can't do with my data that an LLM may be able to help me do, things like anomaly detection, where I can-- I understand what anomalies are based on my current understanding, but what if I just gave this to something else, or this data, and said, hey, what are the anomalies that you see? Having context in these situations is very important. And once I can actually use different contexts to my advantage, how do I build in that full corpus of information and really take action on that? So let's walk through some ways that we can start thinking about prioritization of all these fun things. And then we'll get into talking about some of those tools and value pieces here in a little bit. First, speaking to the last slide, what are the things that you as an individual or you as an organization want to prioritize as your use cases moving forward? And here, I think there's really kind of two ways that we've seen this go so far, right? One is that we pick something that's crazy, something never been able to be done before. Let's try to actually do that and use this technology as a bridge for that. The other is, let's pick something that's small but is something that we can actually measure and test and see the potential benefits of. Both of them are completely valid. It's also good to understand what are-- what's the level of effort I'm going to apply, if it's a moonshot or if it's something where I'm looking for incremental gains on. Obviously, thinking about the platform is very important here. So what are the technology pieces? What are the business processes and all those things that go with it? And then if we start to think about moving into the middle, this is really where it's, OK, what are those models or the actual large-language things that I'm going to use in order to get this work done? So first, obviously, is, OK, well, what are the models that exist? Based on my problem, is this a code problem? Is this a language problem? Is this unstructured generation? What are those things with it? Next is, how do these models actually perform at the core tasks? And then how are these models ultimately integrated? So take a very, very, very large model with trillions of parameters in that. If my use case is sub 200-millisecond real-time Q&A, I probably have a disconnect in terms of what my requirements for value are in the model it is that I'm trying to actually use for generation. And then really here, where we start to get into, is, what are the tailoring things that we need to do for this? And first question, do I actually need something that's custom? Or can I use things like prompt inputs or context setting to actually make the LLM effective without all of the additional costs incurring with something like custom training? And then, ultimately-- and I think this is where we've really seen a lot of momentum in the past six months in terms of trying to build a common understanding for this technology, is how do I ultimately keep my data safe, and keep my data safe within my organization safe, and then keep my customer's data within my organization safe for that? And so here-- and I can't stress this enough. This is where we're really trying to think about, how do we bring Compute or the AI Compute to data versus taking data to the AI Compute, because ultimately, the data is the high-value thing. And what we want to make sure of is that we can keep it safe, keep it in residency, keep it in all those things that matter for it. And when we do all these things, what we ultimately can do is start to close that gap between the data pieces and the business questions that I have and then the actual value propositions or the opportunities that I see in order to close that. And here, it's really built on, again, the premise of, let's bring AI to data. Let's unify those pieces together so it's a great experience. And then let's bring the ecosystem partners and pieces into that so it's very easy and straightforward to do for our work. All right, fun things, right? A lot of things that we have to think about in terms of prioritization. But let's talk about things as engineers that ultimately we find super cool, which are tools we get to use in order to do cool things with LLMs. So why do people first come to Google, right? Folks here, I assume, many people are working with tools like BigQuery. Really, we are trying to make a data and AI-first version of Cloud in order to do all the cool things that we believe the next 10 to 20 years of business problems will require to be solved. Next, we really try to make sure that we're bringing that compute to your data regardless of where it is, so using things like BigQuery Omni to run in place in your data in another cloud or doing something like Spark to be able to run on an edge node where your data actually resides. And I think here, years of research having gone into this, both for our work and then to benefit the larger community in LLM pieces, ultimately yielding a highly secure data platform for these things to do in that. And with that, if we think about for doing this for any user, it's really a couple core products that we try to think of in terms of that initial value story. And the center here is really using BigQuery and Vertex AI together in order to build use cases and build amazing things across our different pieces and stacks. So with that, let's start to get into some of the different attributes of the data cloud here and-- or, sorry, the AI cloud, not the data cloud. The data cloud is the other thing that I do on the side. Here, for the AI cloud, what we're working with or working toward with Vertex AI is really having that end-to-end experience, whether that's your initial kind of hypothesis building or your actual production machine-learning operations platform, have a single-stop shop for doing anything that's required for AI in order to make it more useful and more effective for your organization. All of this is built on things that we call foundational models. So folks here have probably heard about PaLM. PaLM is obviously a set of foundational models that we can use for things like check-- sorry, text, chat, code, even now getting into image gen for building new and novel things on top of that. And all of this built obviously on things like TPUs and GPUs, where you're getting the scale of Google but without the $10 million cost of infrastructure and buildout in those pieces of it. And then working up is the experiences that we get as developers, using things like generative AI App Builder so that way we can do things like test-- are we going to get the right images? Are we going to be able to actually build a chatbot in this experience?-- and then ultimately integrating these things into different platforms, so whether it's something we may provide as Google for using something like Duet AI or whether this is actually a vertical solution, where you may take a component of this and implement it in your own platform for customers in that. So I think if we were playing buzzword bingo, I've probably said "security" 5 or 10 times. I should probably say it a couple times more. It's very important to us in all of these things that your data remains your data. And if you think about even the new rules in particular for the EU around customer privacy and digital privacy and all these things, our intent here is that we bring all these data pieces into your perimeter so you can use them, but everything always remains secure. And there's things like using service perimeters so that way you can prevent things like exfiltration. And these are all things that we provide just out of the box in order to actually make your data highly secure in whatever region you need in order to make sure that your work can be done. All right, so we first talked about what are the foundational components that we have. The next thing that we've been launching, and you've probably seen a lot of these things in briefings and news things, is Duet AI. So Duet AI is essentially our chatbot for you as developers or analysts to be able to do different things and have really powerful AI-driven experiences for that. And ultimately, our job here with this is to help you be more effective and more productive. And with that, what I'm going to do is I'll cut over. We have about a 2-minute video because there are so many different components of this and using these are really effective in terms of just conveying, what are all the things in the personas that we can actually work with with Duet AI on. So I will cut over here to that. [VIDEO PLAYBACK] [UPBEAT MUSIC] - And thence-- pop. [MUSIC PLAYING] [END PLAYBACK] All right, so let's, because we are talking about engineering here, talk a little bit more about one of those areas in Duet, which is obviously one that I spend a good amount of time with. And that's, how do we use Duet AI and BigQuery and all the things that we just talked about? So here, what we're seeing is just a little movie here on the side, is how can we actually use real-time completion in order to go through and to start making, essentially in this case, a PySpark document? And here, what we're using Duet AI for is those inline things where, if I think of my own coding pattern, quite frankly what it usually resides with is me having two screens open and one with the thing that I'm writing and one with the internet as the search engine on there and trying to just make sure that I can get the right pieces in place. And here, this is really where, if we think about the experiences that we can power using tools like Duet AI is-- you know, ultimately what we want is how can we do everything in one window, do it effectively, and do it in a way that makes sense to create that value? And using here things like SQL completion and Python completion are really ways that we're trying to make this easier for practitioners to do more. We have had the pleasure of working with some large organizations like L'Oreal. And here, I think that the key takeaway of all these things, as much as we love to talk about tech, is ultimately what we are doing is using tech to help engineers and to help companies do better things with AI and always exciting to have organizations that we can work with on such fun use cases to grow and to build together. So next thing here is-- and I would encourage everyone to try it once we get out of here-- is Generative AI Studio, which we just introduced. And here, using Generative AI Studio, this really just gives us that authoring place to start building interesting LLM experiences and make sure that those prototypes make sense. One of the things I always think about is, if it's going to take me two months to build a thing, well, how can I validate in 15 or 30 minutes that this is something that's worth my time? And I found using this is, well, one, it's kind of fun because I can start making images of really crazy things. But it does help me to make sure that my value cases and things that I'm building are useful in that. And all those foundational models-- so we talked a little bit about PaLM for text, PaLM for chat. The other things here in the bottom and what we saw with Duet is, how do we then bring things like code completion and Codey into your coding environment? So if you think of all the surfaces that we have as interfaces to doing code and other development work, Codey here is really, how do we then make sure that if you're working in our Console, if you're working in Visual Studio code, if you're working in another IDE, that, again, we are bringing the compute to where you are as the developer. We're bringing it to your IDE, to your data, and ultimately helping you get more done. And what this will yield is really the next generation of both Vertex AI and the BigQuery platform. So obviously, there's a lot of fun things that we could talk about. And we could probably talk for hours about MLOps and, how do we make things run in production? The other thing is, how do we make it easy to serve all these models at scale? And using things like Model Garden, having better prompt design and tuning support in generative AI, again, it's really based on the learnings that we've had so rapidly over the past 10 months and sharing that with everyone in order to build great things. So with that, what I'm going to do is I'm going to cut over to my laptop here. And we're going to do a couple quick demonstrations. So the first one will just be one of, how does Duet AI in the Console work? How do we get to see some of the things that we can do, even in terms of generalized question and answer that we can give it? And then the other one, we'll walk through being able to actually use an LLM in some of our data analysis but for using it to actually do an unstructured data analysis with some data to make some sense of it. So with that, I'm going to cut over here. All right, and my screen is up here. So just for context where we are, I've started off here, I am in the home page for Google Cloud. And as we can see here, I have a little chatbot, which is Duet AI, which I can go ahead and open, so I will do that. And so one of the most common things, if we think about data analysis, is, what are all the things that are available to me in order to actually perform data analysis? And so something we may consider is, OK, well, what are the generalized things that I may want to do? And then what are the products that I may want to do with that? So I could do something like, what should I do for data analysis? What we can start doing is start just to build some generalized understanding for different things we can do. And here, what we're seeing is the response is, hey, well, we can actually use BigQuery. But there's a few other services that we start talking about. Maybe we talk about real time with data flow. Maybe, also, we're an open-processing shop, and we want to do things with Spark. So I can see that I'm starting to get some product association based on the intent of the activity that I'm looking to drive. In this case, maybe I want to learn more about BigQuery. So what is BigQuery? Cool, it's a fully serverless data warehouse. Awesome. What is that? Well, how would I source that? Here, we also try to do a lot of inline hinting. So I think one of the really important things with LLMs, and in this case, it's no different than any other one, is how do we actually provide citations for information? Because when we can provide citations, that means that the LLM is probably not making things up. I said "probably" to that because there are sometimes times when that happens. Great. So I know what BigQuery is. And in this case, we want to do some unstructured data. So how would I get unstructured data in BigQuery? Whoa, that's a lot. All right, let's break this apart because I just got a lot of great information here. So essentially what this is telling me is that I can use BigQuery, I can use Data Transfer Service, or I can use the storage API to help upload a lot of data into BigQuery. And then here are some different ways I can do it and how I can analyze it. I can also obviously come through, and I can start to look through things like, what is the actual unstructured documentation? And the goal here isn't necessarily that we're going to give the answer, right? We're not going to say, hey, what's 2 plus 2? Well, it's 4. That's a very deterministic thing. And undeterministic things here is how are we bettering the practitioner's understanding? How are we giving you the right links to find things faster and then ultimately create that value as a part of those statements? All right, so we've started figuring out what are some different things that we want to do with this. Let's say I actually have some SQL now that I want to go execute. And what I'm going to do with that is figure out, hey, what is this actually going to do? And then how do we actually go implement it for this? So the first thing we can pull up, we can see here I have a join table. And here, what we're starting to work with is some sales data from our organization. First thing I want to do is understand, hey, of all this data, what does this query actually represent? How many times here have we seen a query? Someone else wrote it. We pulled it up. We looked at it. You're like, I have no idea what that thing actually represents. Well, let's break that apart. Here, I can click this little button. And what we've actually done is use some context. So we've said to Duet, here's the query. Now take this, and actually explain it to me in practitioners terms what's actually happening. So here, we can see what we're actually doing is we're finding in this case, because we're using GROUP BYs and ORDER BYs, we're finding the top 10 users in this space and all of the different sales attributes that they have with that. So we can see here that it writes this out. And while sometimes this may look longer than the query that we're representing, what we're doing here is trying to just make sure that we can build that natural language understanding of what the query is performing in that. So next, let's say then that I have that sales data. And I'm going to go build it, what is a BigQuery machine learning model, in this case an ARIMA forecasting model. So what we're going to do is we're going to have this model. It's going to execute. And then the thing that I actually want to get from that is, how do I actually get a query that says, OK, I want to iterate on this model? So if I were to run this-- and I ran it because it takes a minute or two, and I figured you don't want to have me up here trying to tell jokes or any of that; that would be bad for everyone-- I'm not going to run that. But maybe let's say here I want to understand, what is an ARIMA+ model? Cool. I have some information that's displayed. And so basically it's saying, hey, what's ARIMA? Why is this one a little bit different? It gives me the link to do that. And in this case, what I want to do, because I actually want to get a forecast off for that, we can see here that just by typing in this intent-- and think of this as like a chat turn-- as soon as I do this and I hit Enter, there's a little wheel that's spun. And then I actually get my query back here that I could execute in order to get that forecast. And here, we can see the results as we pull it up. All right, pretty cool, right? So we've started just taking some of those different engineering workflows. We've done a little bit of data analysis. We've used an LLM to help get these things through. Let's talk about one more use case, then, which is, instead of having that LLM help us drive productivity, how do we actually use an LLM to create some more value statements from different things that I may have? In this case, what we'll do-- and here I just have, again, some simple SQL and a script. Let me go ahead and make it slightly larger. There we go. So what we're going to do in this is we actually have some speech files. And here, working with unstructured data is kind of a hard activity, right? It's a bunch of WAV files. It's a bunch of MP4's, MP3's, things like maybe I get off of a tech or maybe I get off of a customer call or some other video that may come in. Well, what I need to do is actually figure out how I can analyze these better and at scale. So the first thing that we have here, which you could execute, is just building an external table that actually is an object-linked table in order to pull all of that data into BigQuery for analysis. So when I create this, all that I do then is I have that object table. And now I can use all the cool things in BigQuery in order to actually bring computation to my unstructured data in that. So what we can do is if I were to do something like see what's in that speech files table, we can see here, here are some different WAV files that we have. And now, the cool thing that we're going to do is we're going to actually run a statement. We're going to encode this using a Chirp model. And that would be just using this statement here. And then what we're going to do is we're going to actually use an LLM against this data. So we're going to help summarize all of this text into a few things that we can display. So with that, let's say I have it in these things. Here, I have my GENERATE_TEXT statement. So using a single SQL function, we can actually go through and we can pass in what our prompt or our question is. I can give it some text as an input. So this is the context that I'm going to apply. And then we can also do things like tune the model. So how creative do I want the model to be versus drier fact-based? What are some of the other things that I want available? And then what are output tokens? So how much do I want the LLM to come out? And then when we have all those in, what we can do, if I run our last piece here, is we can see that I actually have one of my results. And here I have the content. So what the LLM summarizes is the key things that happened in the conversation. And if we had any categories for things like unsafe things, like violence or safety or something else, we would have those here with scores also displayed in terms of that severity. So here, again, making it easier as you start to interact with these things, understanding, hey, if this is a safe thing, maybe it's OK that it goes out. Anything that's questionable, maybe I have a human in the loop to review and to get these things for it. All right, so what have we done today? In this part, we just summarized up how we can use Duet. And we've also started talking through some of the cool use cases where you think about bringing those Vertex AI models to BigQuery that we can use in order to build really interesting and valuable things for our organization. [ENERGETIC MUSIC]
Info
Channel: Google Cloud Tech
Views: 9,315
Rating: undefined out of 5
Keywords:
Id: T0YoctXuSTw
Channel Id: undefined
Length: 33min 40sec (2020 seconds)
Published: Tue Dec 19 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.