How to build Multimodal Retrieval-Augmented Generation (RAG) with Gemini

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC PLAYING] JEFF NELSON: Hey, welcome to today's live session on multimodal retrieval augmented generation. My name is Jeff. And I'm a developer advocate at Google Cloud. SHILPA KANCHARLA: And I'm Shilpa. I'm an engineer working on generative AI at Google. Thank you for coming to our session. We're so excited that you're here. JEFF NELSON: So imagine you turn on your car, and all of a sudden, the dashboard lights up with all sorts of error lights. There's a big red triangle. There's something that looks like an engine that says check on it. Fear wells up inside of you. Should you call a tow truck right away? What could be wrong? Is it safe to drive? The owner's manual is hundreds of pages long. And so instead, you just snap a quick photo like this, and you upload it to your trusty large language model. Now, you upload the image and you ask, what's wrong with my car? You wonder, is this a minor glitch? Or is this some sort of major malfunction? Can you risk a quick trip to the store? Or do you need to call the shop right away? And upon answering, the LLM spouts out a generic list of troubleshooting tips, or even worse, a confidently incorrect answer, leaving you even more confused. LLMs can lack the vital connection to your specific issue. Your car's data, its unique history, its specialized terminology, maybe its past repair history, even your typical driving conditions are all typically absent from the vast yet impersonal data that an LLM was pretrained on. And this creates a knowledge gap, making it difficult for you to receive the answers you want in truly specialized ways. But imagine if AI could reach beyond the data that it was trained on and instead retrieve context on demand from your private or first party sources, and then augment its knowledge base to synthesize a tailored response to address your exact issue. And that's what we're going to talk about today with Retrieval Augmented Generation, or RAG. Just note that you don't have to be a RAG expert today. You don't need to be an AI expert. We're going to try to break this down and make this as accessible as possible. So on the agenda, we'll address what retrieval augmented generation is, how it can help address some of the hurdles that you might face with large language models. And we'll begin by giving an overview of RAG and dig into some of the architectures or the main components that make it tick. Then we'll show some of the common RAG architectures that use multimodal data, like unstructured text or images, and discuss how we can apply rag to this type of data. Then we'll jump into an in-depth live demo to show how everything fits together. That is how we went from a large language model that gave kind of questionable responses to something that is grounded in reality with your first party data. In doing so, we'll be using the Gemini API from Google AI for developers as well as Vertex AIs from Google Cloud platform. Lastly, we'll end with a summary and some next steps. So retrieval augmented generation, it's able to take your powerful LLMs and inject them with real world knowledge about your own business. It can pinpoint subtle code changes, unearth past errors lurking in your repositories, or even resurrect those late night chat threads between developers where they miraculously solve some sort of problem that you're facing right now. In essence, RAG is able to take your LLM and ground it in the reality of your own code base, your documentation, and your team's collective wisdom. It ensures that your AI assistant is not operating in a vacuum just on its pretrained knowledge, but instead working with the same facts, the same history, the same resources as your most seasoned employees have access to. And note that RAG also differs from fine tuning or training new models. Instead, it augments your existing large language models, and enables rapid adaptation, and gives you highly focused responses, meaning that the AI can remain aligned with your business as it continues to grow and change. And when we break it down, RAG has three main components. We'll walk through each of these separately over the next few minutes. First, we'll explain vector embeddings. And these are simply numerical representations of your data that capture semantic meaning. Next, we'll explain how vector search works for retrieval and also introduce the basics of vector databases and how they make vector search a little bit quicker. Lastly, we'll augment a model like Gemini with several inputs and then generate a cohesive answer for our users. So I'm going to pass everything over to Shilpa next. And she'll talk more in depth on each of these topics. Shilpa. SHILPA KANCHARLA: Thanks, Jeff. You gave us a really great overview to the different components of RAG. Right now, I'm going to discuss the first component, embeddings. Think of your code, documentation, and error logs as different languages. A human developer can usually piece these different components together. But machines have traditionally struggled to do so. This is where embeddings come in. They act like the universal translator. So that machines can understand these components. Traditionally, we generated embeddings on textual data. That is, we took text, we fed it into an embedding model, and out came a vector embedding, which is essentially a list of floating point values. This allowed us to find text that was similar to other text without needing to extract keywords and search upon them. These embeddings don't just represent words, but rather they represent the meaning behind those words. Let's take an example. OK, everyone visualize the word "bug." Now that you have a picture in your head, what if I said something like I found a bug in my garden? Obviously, in this scenario, I'd be talking about an insect. But what if I said something like I found a bug in my code, or this person is really bugging me? The meaning of bug has completely changed. And this is what embeddings can help with. They're a way of turning unstructured data into numerical representations that machines can understand. The numbers in these embeddings capture relationships between words based on the context. So words that often appear together will have similar embedding values. And we're no longer confined to just text. We can generate embeddings on multimodal data. So this includes text, images, audio/video, and even code snippets. All of these can be represented as vectors in the same semantic space and same dimensionality. Similar to with text, all we need to do is submit our input to a pretrained embedding model or multimodal embedding model and numerical vectors are returned. One concrete example might be say you're generating product recommendations and you not only want to take into account the user's purchase history, but you might also use their browsing history to look at images that they've seen to find similar aesthetics that they're interested in. And you might be wondering, OK, how do I create these embeddings and where do I store them? Don't worry, you don't need to be an embeddings specialist overnight. There are powerful tools to help simplify the process. Google offers an array of different embeddings APIs. In this example, we simply chose a pretrained Gemini model, feed in our input, and out comes a list of floating point numbers. While these lists of numbers called embeddings don't mean much to our human eye, they mean a lot to machines and contain a great deal of information. And secondly, I'd like to move on to vector search, which is our second component of RAG. Our collection of embeddings represents codebase, internal knowledge, documentation, and more. But we need to find the right embeddings quickly. And this is where the retrieval part of RAG comes in. And this is where vector search is used. Traditional search relies on matching exact words. Vector search is smarter and understands the meaning behind those words thanks to embeddings. This lets RAG find relevant information even if your search query doesn't match exactly what's in your document. One approach to vector search is to compare all of your embeddings to every single input in that vector database. This means calculating the distance between each of the inputs in your database. While this will return the exact closest match, this probably won't scale well to millions or even billions of inputs. So brute force is really better reserved for prototyping cases or smaller data sets. A popular approach is called approximate nearest neighbor search. In this case, embeddings are indexed to form similar groups or clusters. And when a user performs a vector search, the embedding input is first checked against the closest group. It's then compared against all the embeddings in that group to find the most similar text. This approach is significantly faster and scalable and has impressive recall. That is, it can provide you with a greater amount of accuracy. The following visualization shows the power of multimodal embeddings. You'll see three axes in this embedding space, color, size, and living. When I plot the embeddings of the image and text, notice how the image of the cat in front of the tree and the corresponding text description occupy a similar position in space. This isn't a coincidence. Powerful embedding models have allowed us to represent the meaning behind text, images, and more. Because of this, we can now search for images using text queries or search through video using images, and so on, and so forth. The convergence of modalities open up new applications for search and discovery of your data. Just note that this is a simplified example for explanation purposes. But in reality, embedding spaces can be hundreds or even thousands of dimensions that a 3D space cannot represent. And then to discuss the last component of RAG, I'll hand the stage back over to Jeff. JEFF NELSON: Thank you, Shilpa. So we'll now discuss the last component of RAG, which are large language models like the Gemini model. So we've talked about embeddings. We've talked about vector search and how we retrieve the right information. But now, let's discuss the A and the G in RAG, how to augment and generate using these large language models. Now, some models like Gemini have been trained on huge amounts of multimodal data and are able to understand content across those different modalities. So maybe you've used Gemini to summarize a video, or give you a description of an image, or even help you code. That's all multimodal. And after retrieving the relevant chunks of data from our vector search, we want to feed them into our large language model to augment it. The large language model takes the retrieved knowledge and feeds it into a larger prompt and then generates a human readable response. And the outputs are not just copy/paste versions of the retrieved info, but instead, the LLM synthesizes all of this specific facts from the knowledge base and combines it with its own pretrained knowledge, meaning you get accurate and up-to-date information that's presented in a way that's easy to understand. So let's put all of this together. And let's return back to that first example that we gave where we asked a large language model about the picture of my car. There, we asked what does the light on my car mean? An LLM generated a response for us. And then it gave us an answer. Naive LLMs, as impressive as they are, have limitations. Their knowledge is frozen in time. And they can't incorporate new information without some sort of retraining. But with RAG, we expand this architecture a little bit. We have these three foundational aspects that we've talked about. So in addition to our large language model, we also incorporate embeddings and vector search. So when a user asks a question, that question is turned into an embedding. The embedding is then matched against all of the items in our vector database. And it looks for other text chunks, images, code snippets, and more within that embedding space. The top items are then retrieved. And we find-- we can actually go back and retrieve the original document chunks or images. All of this is fed into the large language model to give us an answer that's a lot more complete. And the responses are more up to date with our current business context using data that might be your first party data that the LLM has never seen before. So we've discussed the components of RAG and its general architecture. But this session is also about multimodal RAG. So let's discuss a couple of multimodal specific architectures next. And there are a couple of emergent patterns for multimodal RAG. In the first, we take all of our multimodal data, images, unstructured text, code, video, and more, and we summarize it all using a model like Gemini 1.5 Pro. We summarize it all into text. These summaries are then turned into embeddings and stored in a vector database. And then when a user goes to ask a question, retrieval takes place by comparing the input query to the text embeddings in our vector database, including summaries of things like your code or your images. And the output then could include not only those summaries, but also the raw images or code itself. This approach is rather straightforward, but does risk some information loss because we are generating some text summaries. In the second approach, we could use multimodal embeddings across the board. That is, we take our multimodal data, like our text, images, audio, video, and we turn them all into embeddings in the same embedding space. And we store all of these items in a vector database. Now for retrieval, we can input our items in different modalities. We could input text. We could input images. Again, that's turned into an embedding and matched against all of our multimodal embeddings. Then we retrieve the top results. We get the actual raw assets back. And then we use a large language model to summarize everything for us. So this approach generally has higher accuracy. And there's a little bit lower information loss. But it does require that we have pretty flexible multimodal embedding models available for us that can take images and texts of various sizes and lengths. So in the next section, we're going to dive deeper with a live demo. And in the demo, we'll use the first approach that we see here, which utilizes text-based embeddings for multimodal retrieval. So Shilpa will take it away for this demo. SHILPA KANCHARLA: Thanks, Jeff. You gave us an amazing intro to the Gemini model. Over the next few minutes, we'll now witness how Gemini 1.5 Pro can be used to perform RAG. Today, let's imagine we purchased a new car. It's the 2024 Cymbal Starlight, a modern vehicle, and we want to ask some questions about it. For example, what does this light mean on my dashboard? Is it bad enough that I have to take it to the shop right now? Or how many miles until I need to get an oil change? Or what if I broke down on the freeway, how do I ask for help? Now, if we ask an LLM any of these questions, we're not going to return with really great responses. For example, I can ask Gemini 1.5 Pro how many miles can I drive my 2024 Cymbal Starlight until I need to change my oil? And it responds with a bunch of information that isn't really relevant to us. And this is because the LLM wasn't trained on the 2024 Cymbal Starlight manual. Why? Because our team made up this example. And it doesn't actually exist in the corpus of training data found on the wild. And this is a constraint of LLMs. But to solve it, we can use RAG, which augments the prompt sent over to an LLM with data retrieved from an external knowledge base, like this handy owner's manual. Let's see how it works. Before we get too code heavy, it's important to understand the preprocessing components of a RAG pipeline. In this pipeline, we begin with a PDF or a set of documents. We split them into their component parts, so text is split into smaller chunks. And images and tables are extracted. Each of these chunks, images, and tables will then be turned into embeddings. Then these embeddings are stored in a vector database for fast lookup and retrieval. Let's take a look at how to achieve this by moving over to the notebook. We'll be walking through a Colab notebook today. And before the session, we took some time to get connected to a runtime, installed a number of useful libraries, and did some setup like defining a Google Cloud project. So let's skip all of that for now and dive straight into the demo. We'll begin by downloading our source data. As mentioned earlier, this is simply the PDF version of the 2024 Cymbal Starlight manual. Here, we downloaded the PDF document that contains a mix of textual information and visual elements like images and tables. And we can now see in our Colab environment as well in the left hand panel. As we showed in the architecture diagram a minute ago, we first take the PDF and split it into its component parts. In this cell, we extract the images and tables from the PDF. We also split the text into smaller chunks based on section titles and headers found in the source, as well as the content length. This also efficiently processes and generates embeddings for each element. It also improves accuracy in retrieval because you can retrieve a specific part of your text, rather than, say, an entire page or a group of pages, et cetera. While this is running, we can take a quick look at the PDF document as well. So here's our PDF. And if we scroll through, we can see some visual elements as long-- yeah, visual elements right there. And if we keep going, we can see a table as well. But as you can see, you can see how perhaps this PDF might be split up into different chunks based on the section titles or section headers. And now once this is complete, we can see the images extracted locally in the Colab. Yep, there they are. And we can also see the sample data as well. Moving on, additionally, it's important to categorize non-textual elements we extract into tables and texts. These will be treated differently in our RAG pipeline as well as the Gemini model to transcribe these tables into text and summarize long text chunks. All you need to get started is an API key in order to get access to the Gemini 1.5 Pro model. We have a key here. And we kept it in Colab Secrets. Next, we define our summarization model as Gemini 1.5 Pro. We define a prompt that instructs the Gemini model to generate concise summaries of the extracted texts and tables, which will be optimized for retrieval later on. In this demo, we only summarize some small tables in the manual. However, you can use this as a reference if you're dealing with larger amounts of text, say like with a textbook. With that taken care of, we can turn our attention to the images that might reside in the PDF. We've extracted each image in the manual locally. And we now need to generate summaries about each of them. We define a prompt similar to our text summarization prompt just shown a minute ago. But this time, we'll instruct the Gemini model to describe the content of the image, focusing on key elements like tables, graphs, and their meanings. The concise summary is represented as text and will eventually be used for embedding purposes. We'll begin by encoding each image into a base 64 string for embedding and transmission. This is a useful step whenever you're transferring images only to a text only system and processing them. After this, we generate summaries for each of the images using the prompt that we previously defined. And we use the Gemini 1.5 Pro model again to carry this out. From here, we have our raw image summaries. These will be turned into embeddings later on for retrieval. Let's take a look at one of our images and the summary generated by it by Gemini 1.5 Pro. OK, here's the first example. We see this coolant temperature warning. And you can see a red symbol against a black background. And as you can see in the summary, it describes exactly that as well as the description of how the red pattern looks like. And let's take a look at our second example, as well. So here, we have a low tire pressure symbol. And we can see a yellow symbol against a black background. And that's exactly what's described as well. Yep, and now that we've taken a look at that, I'll turn it back over to Jeff, who will explain different components like vector store that will come in the next steps. JEFF NELSON: Awesome. Thank you, Shilpa. So in order to retrieve the embeddings that Shilpa just spoke about, we actually need to store them somewhere. And Google offers an array of different vector databases to do so. Today, we'll be using Vertex AI Vector Store to house our embeddings. And this cloud-based offering enables millisecond lookups and uses the same infrastructure that's used for Google Search and YouTube. You can see that we input a couple of different parameters here, one of them being the dimensionality of our embedding model and a couple of other items. Next, we deploy-- we define an index endpoint. And this is going to allow us to actually use this vector store outside of-- within this notebook. And lastly, we deploy that index endpoint. So that we can use it. And this takes a couple of minutes to run. So we've run it beforehand. And now, it's time to turn all of the embeddings that we saw into chunks. So we'll do this next with a vector store. We're going to define this with LangChain. We're going to define this vector store using the index and endpoint that we created just a minute ago. You see a couple of other items here around our Google Cloud project, and importantly, the embedding model. We'll be using the Google API or Gemini API embedding model 001. And we'll stream all of these updates directly into our vector store. So I'll run that. And next, we use a multivector retriever. The reason is our vector store will only contain the embedding and an ID. But we use a document store that's indexed by that same ID, so that we can access the original source documents after searching through our embeddings. In other words, when we do the retrieval part of our RAG pipeline, we don't just want to retrieve a bunch of lists of numbers. We also want to retrieve the original images and the text chunks in order to feed them into Gemini. And with all of that setup taken care of, we can now load our raw documents into the document store. And we can additionally generate embeddings for all of our text chunks and image summaries that we created earlier and stream them into our vector store, again, Vertex AI vector search. So I'll run that. And we'll see some things start to populate below. Each of these UUIDs here represents an embedding that was just streamed into Vertex AI vector search in real time. So the next step is actually accessing all of those in our Q&A pipeline. So just a reminder, to go back to this slide, our Q&A pipeline, when we ask a question, we turn that question into an embedding. And then we search across all of our embeddings in our vector database. We get the matches. And then we turn these-- we join these back to our document store to get the original text chunks, any summaries, any images. And we feed all of this into our large language model to give us a great answer. Whoops, sorry. So we're going to begin by constructing a RAG chain, which is just a sequence of operations that forms our core application. And it contains a couple of different components. It first starts by finding any relevant information from our document input chunks. This includes both text and images that are identified using the input query. Next, we use the image prompt func that instructs Gemini to act like a helpful automotive expert when it provides its responses. And this preprompt helps the Gemini model use verbiage around the topic at hand, auto repair, rather than acting just as a generalist. This returns better responses for us as well. And lastly, we also have the chat generative-- Google generative AI class, which allows us to access the Gemini 1.5 Pro model and send through not only our context-- our user's query, but also the retrieved context from our vector database. This generates a coherent response and gives it back to the user in a readable format. With that, with all of that set up, right now, I can define a multimodal search function, which is kind of a wrapper around everything that we've just done. The multimodal search function takes a text query as an input. It then runs a vector search to identify texts and image summaries that are the highest match. The chain then pulls the raw components, like images or raw text chunks, and summarizes them using the Gemini 1.5 Pro model. It then prints or displays all of these texts and images, we see a lot of print statements here, so that we can just be assured that we are actually pulling the right things in. The result is a summarization that's grounded in our truth, or in our input data. OK, so I'm going to go back to that question that we asked earlier. I'm going to ask how many miles can I drive until I need to change the oil on my 2024 Cymbal Starlight? And when I run this, again, the text query is being run through. And we get four different chunks back. And these chunks are taken directly from our input PDF. Those are then all summarized by Gemini 1.5 Pro. And we get an answer. We should change our oil every 5,000 miles or every six months, whichever comes first. So we're now able to get an answer that we couldn't get before because we're grounding our data-- we're grounding our results in our first party data. Let's ask another question. And I'm going to use an image this time. And this is a kind of apt timing because I have an image from my own car. Let's pretend that I drive the 2024 Cymbal Starlight, and I have my own image that I took yesterday while driving over here. Let's take a quick look at it. I uploaded it from my desktop. And let's take a look. We see that I am driving in front of the Google building, but I also have a light on my dashboard that just showed up. So I'm going to ask Gemini to act as an assistant that can describe warning lights on automobiles and give me a description of that warning light. So we get an answer. There's an amber colored icon. It's shaped like a horseshoe with an exclamation mark in the center. And I think that's a pretty good description of the error light that we're seeing here on my dashboard. Then I'm going to add a little bit more to this text. So I can say, what does the following mean? What does this following light mean? And I'm going to pass that through the multimodal search function that we created earlier. Again, we retrieve a couple of text chunks. And it looks like we retrieved some matching images from our source documents as well. Now, the top one doesn't look like it's totally relevant, but the bottom one definitely looks like what we intended to retrieve. It looks just like what was on my dashboard. So I could validate that looks pretty much like what we'd expect. And then I also get Gemini 1.5 Pro to summarize an answer for me. It's saying that the icon on my dashboard is a low tire pressure warning, a TPMS warning light, which means that one of my tires is under-inflated. I should check the tire pressures and probably inflate those tires to get rid of that error message. So this is cool, I can now look at-- I can input an image. I can input some text. And I can retrieve both text summaries and the raw images themselves. So I can validate that I'm getting the right answers back. Now, I say, OK, I know my tire pressure is low. But what should the tire pressure be on this automobile? So I'll just ask that follow-up question next. Again, we retrieve a number of different text chunks. And we receive an answer that's summarized from our grounding data that says I should pump up my front tires to 35 PSI and my rear tires to 38. So I should probably do that over the next couple of days. So I can clear that error message from my dashboard. We also get a little bit of additional information from our owner's manual that tells me where I can find that tire pressure info if I were so inclined to do so. But it's pretty convenient to just be able to ask these questions from my RAG pipeline instead of having to scroll through or look through hundreds of pages of an owner's manual to get that answer. RAG was able to generate this answer using our first party data and even show an image from our manual. So I know it's the right warning light. And that's the end of the demo here. So let's move back over to the slides. So just to summarize what we spoke about today, RAG is a powerful architecture that combines information retrieval techniques with generative large language models and enhances the quality and relevance of our generated text. The main idea is to provide the generative LLM with access to external knowledge sources, like your first party data or your enterprise data, and allow it to retrieve the relevant information and incorporate that into the generation process. And although today's examples all related to cars, we're not actually confined to any one single industry. So we could talk about maybe the technology industry, where multimodal RAG could help accelerate complex legacy codebase migrations. We could analyze code, architecture diagrams, any chats between developers, any documentation. And we could suggest any migration strategies. We could identify similar modern code samples and really speed up that time to value for the migration. In the retail sector, a customer might have-- a customer might ask a question like, where can I get a shirt like this? Or where can I get pants in a different color? They might even upload a pair of pants or some shoes and say, where can I get something like this? Do you have this in stock? Again, RAG can analyze the question that's from text. It can analyze the image. It can analyze the video, and check against your product catalog, and craft a direct answer and summary back to the user. Not only that, but it could include product links and maybe other recommended products that the user might be interested in. In the media and entertainment sector, a user might be looking for a movie on a Friday night. And they might want to input a combination of text, images, or audio to get the type of mood of the movie that they're looking for. So I might say I want some rainy day vibes because it's raining. I might upload a picture of a cozy fireplace. And I might also say, hey, I love this mellow jazz track. Multimodal RAG could take all of these inputs, synthesize them, and extract the emotional tone, the visual aesthetics, the audio cues, and provide me as the user with the right movie for that movie night. So this slide contains a list of how to get started. The Gemini API landing page is the first link. And it's the quickest way to get started using Gemini. All you need is an account and an API key. We also have Vertex AI, which is hosted by Google Cloud and now provides over 150 foundation models, including Gemini 1.5 Pro, allowing you to run your workloads at enterprise scale. And we also include a couple of links to code samples. First, to the Gemini API repository, which includes references to a number of different notebooks and also to our Cloud Vertex AI GitHub repo, which contains dozens of notebooks from how to get started to things like RAG and agent building and more. So thank you so much for joining us today to explore multimodal RAG. We hope you found this session valuable and it can inspire you to build something on your own. [MUSIC PLAYING]
Info
Channel: Google for Developers
Views: 33,008
Rating: undefined out of 5
Keywords: Google, developers, pr_pr: Google I/O;, ct:Event - Technical Session;, ct:Stack - AI;, ct:Stack - Cloud;
Id: LF7I6raAIL4
Channel Id: undefined
Length: 34min 22sec (2062 seconds)
Published: Thu May 16 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.