[MUSIC PLAYING] JEFF NELSON: Hey, welcome
to today's live session on multimodal retrieval
augmented generation. My name is Jeff. And I'm a developer
advocate at Google Cloud. SHILPA KANCHARLA:
And I'm Shilpa. I'm an engineer working on
generative AI at Google. Thank you for coming
to our session. We're so excited
that you're here. JEFF NELSON: So imagine you
turn on your car, and all of a sudden, the
dashboard lights up with all sorts of error lights. There's a big red triangle. There's something that
looks like an engine that says check on it. Fear wells up inside of you. Should you call a
tow truck right away? What could be wrong? Is it safe to drive? The owner's manual is
hundreds of pages long. And so instead, you just
snap a quick photo like this, and you upload it to your
trusty large language model. Now, you upload the
image and you ask, what's wrong with my car? You wonder, is this
a minor glitch? Or is this some sort
of major malfunction? Can you risk a quick
trip to the store? Or do you need to call
the shop right away? And upon answering,
the LLM spouts out a generic list of
troubleshooting tips, or even worse, a confidently
incorrect answer, leaving you even more confused. LLMs can lack the
vital connection to your specific issue. Your car's data,
its unique history, its specialized terminology,
maybe its past repair history, even your typical
driving conditions are all typically absent from
the vast yet impersonal data that an LLM was pretrained on. And this creates
a knowledge gap, making it difficult for you
to receive the answers you want in truly specialized ways. But imagine if AI could
reach beyond the data that it was trained
on and instead retrieve context on demand from
your private or first party sources, and then
augment its knowledge base to synthesize a
tailored response to address your exact issue. And that's what we're going to
talk about today with Retrieval Augmented Generation, or RAG. Just note that you don't have
to be a RAG expert today. You don't need to
be an AI expert. We're going to try to break this
down and make this as accessible as possible. So on the agenda, we'll address
what retrieval augmented generation is, how it can help
address some of the hurdles that you might face with
large language models. And we'll begin by
giving an overview of RAG and dig into some
of the architectures or the main components
that make it tick. Then we'll show some of the
common RAG architectures that use multimodal data, like
unstructured text or images, and discuss how we can apply
rag to this type of data. Then we'll jump into
an in-depth live demo to show how everything
fits together. That is how we went from a
large language model that gave kind of
questionable responses to something that is grounded
in reality with your first party data. In doing so, we'll
be using the Gemini API from Google
AI for developers as well as Vertex AIs from
Google Cloud platform. Lastly, we'll end with a
summary and some next steps. So retrieval
augmented generation, it's able to take
your powerful LLMs and inject them with
real world knowledge about your own business. It can pinpoint
subtle code changes, unearth past errors lurking
in your repositories, or even resurrect
those late night chat threads between developers
where they miraculously solve some sort of problem that
you're facing right now. In essence, RAG is
able to take your LLM and ground it in the
reality of your own code base, your documentation, and
your team's collective wisdom. It ensures that your
AI assistant is not operating in a vacuum just
on its pretrained knowledge, but instead working with the
same facts, the same history, the same resources as your
most seasoned employees have access to. And note that RAG also differs
from fine tuning or training new models. Instead, it augments your
existing large language models, and enables rapid
adaptation, and gives you highly focused
responses, meaning that the AI can remain
aligned with your business as it continues to
grow and change. And when we break it down,
RAG has three main components. We'll walk through each
of these separately over the next few minutes. First, we'll explain
vector embeddings. And these are simply numerical
representations of your data that capture semantic meaning. Next, we'll explain how vector
search works for retrieval and also introduce the
basics of vector databases and how they make vector
search a little bit quicker. Lastly, we'll augment a model
like Gemini with several inputs and then generate a cohesive
answer for our users. So I'm going to pass
everything over to Shilpa next. And she'll talk more in depth
on each of these topics. Shilpa. SHILPA KANCHARLA: Thanks, Jeff. You gave us a really
great overview to the different
components of RAG. Right now, I'm going to discuss
the first component, embeddings. Think of your code,
documentation, and error logs as different languages. A human developer can usually
piece these different components together. But machines have traditionally
struggled to do so. This is where
embeddings come in. They act like the
universal translator. So that machines can
understand these components. Traditionally, we generated
embeddings on textual data. That is, we took text, we fed
it into an embedding model, and out came a vector
embedding, which is essentially a list of
floating point values. This allowed us to find text
that was similar to other text without needing to extract
keywords and search upon them. These embeddings don't
just represent words, but rather they represent the
meaning behind those words. Let's take an example. OK, everyone visualize
the word "bug." Now that you have a
picture in your head, what if I said something like
I found a bug in my garden? Obviously, in this scenario,
I'd be talking about an insect. But what if I said something
like I found a bug in my code, or this person is
really bugging me? The meaning of bug has
completely changed. And this is what
embeddings can help with. They're a way of turning
unstructured data into numerical representations
that machines can understand. The numbers in these
embeddings capture relationships between
words based on the context. So words that often
appear together will have similar
embedding values. And we're no longer
confined to just text. We can generate embeddings
on multimodal data. So this includes text,
images, audio/video, and even code snippets. All of these can be
represented as vectors in the same semantic space
and same dimensionality. Similar to with text,
all we need to do is submit our input to
a pretrained embedding model or multimodal
embedding model and numerical
vectors are returned. One concrete
example might be say you're generating
product recommendations and you not only want
to take into account the user's purchase
history, but you might also use their browsing history to
look at images that they've seen to find similar aesthetics
that they're interested in. And you might be wondering, OK,
how do I create these embeddings and where do I store them? Don't worry, you don't need
to be an embeddings specialist overnight. There are powerful tools to
help simplify the process. Google offers an array of
different embeddings APIs. In this example, we simply
chose a pretrained Gemini model, feed in our input,
and out comes a list of floating point numbers. While these lists
of numbers called embeddings don't mean
much to our human eye, they mean a lot to
machines and contain a great deal of information. And secondly, I'd like to move
on to vector search, which is our second component of RAG. Our collection of
embeddings represents codebase, internal knowledge,
documentation, and more. But we need to find the
right embeddings quickly. And this is where the
retrieval part of RAG comes in. And this is where
vector search is used. Traditional search relies
on matching exact words. Vector search is
smarter and understands the meaning behind those
words thanks to embeddings. This lets RAG find
relevant information even if your search query
doesn't match exactly what's in your document. One approach to vector
search is to compare all of your embeddings to every
single input in that vector database. This means calculating
the distance between each of the inputs in your database. While this will return
the exact closest match, this probably won't
scale well to millions or even billions of inputs. So brute force is really better
reserved for prototyping cases or smaller data sets. A popular approach
is called approximate nearest neighbor search. In this case, embeddings are
indexed to form similar groups or clusters. And when a user performs
a vector search, the embedding input
is first checked against the closest group. It's then compared against all
the embeddings in that group to find the most similar text. This approach is significantly
faster and scalable and has impressive recall. That is, it can provide you with
a greater amount of accuracy. The following visualization
shows the power of multimodal embeddings. You'll see three axes in this
embedding space, color, size, and living. When I plot the embeddings
of the image and text, notice how the image
of the cat in front of the tree and the
corresponding text description occupy a similar
position in space. This isn't a coincidence. Powerful embedding
models have allowed us to represent the meaning
behind text, images, and more. Because of this,
we can now search for images using text queries
or search through video using images, and
so on, and so forth. The convergence of
modalities open up new applications for search
and discovery of your data. Just note that this is
a simplified example for explanation purposes. But in reality, embedding
spaces can be hundreds or even thousands of dimensions that
a 3D space cannot represent. And then to discuss the
last component of RAG, I'll hand the stage
back over to Jeff. JEFF NELSON: Thank you, Shilpa. So we'll now discuss
the last component of RAG, which are large language
models like the Gemini model. So we've talked
about embeddings. We've talked about
vector search and how we retrieve the right information. But now, let's discuss
the A and the G in RAG, how to augment and generate
using these large language models. Now, some models
like Gemini have been trained on huge
amounts of multimodal data and are able to
understand content across those
different modalities. So maybe you've used Gemini
to summarize a video, or give you a description of an
image, or even help you code. That's all multimodal. And after retrieving the
relevant chunks of data from our vector search,
we want to feed them into our large language
model to augment it. The large language model
takes the retrieved knowledge and feeds it into
a larger prompt and then generates a
human readable response. And the outputs are not
just copy/paste versions of the retrieved info,
but instead, the LLM synthesizes all of this specific
facts from the knowledge base and combines it with its
own pretrained knowledge, meaning you get accurate
and up-to-date information that's presented in a way
that's easy to understand. So let's put all
of this together. And let's return back
to that first example that we gave where we asked
a large language model about the picture of my car. There, we asked what does
the light on my car mean? An LLM generated
a response for us. And then it gave us an answer. Naive LLMs, as impressive as
they are, have limitations. Their knowledge
is frozen in time. And they can't incorporate
new information without some sort of retraining. But with RAG, we expand this
architecture a little bit. We have these three foundational
aspects that we've talked about. So in addition to our
large language model, we also incorporate
embeddings and vector search. So when a user asks a
question, that question is turned into an embedding. The embedding is then
matched against all of the items in our
vector database. And it looks for other text
chunks, images, code snippets, and more within that
embedding space. The top items are
then retrieved. And we find-- we can
actually go back and retrieve the original document
chunks or images. All of this is fed into
the large language model to give us an answer
that's a lot more complete. And the responses are more up to
date with our current business context using data that
might be your first party data that the LLM has
never seen before. So we've discussed
the components of RAG and its general architecture. But this session is also
about multimodal RAG. So let's discuss a couple
of multimodal specific architectures next. And there are a couple
of emergent patterns for multimodal RAG. In the first, we take all of
our multimodal data, images, unstructured text,
code, video, and more, and we summarize it all using
a model like Gemini 1.5 Pro. We summarize it all into text. These summaries are then
turned into embeddings and stored in a vector database. And then when a user
goes to ask a question, retrieval takes place by
comparing the input query to the text embeddings
in our vector database, including summaries of things
like your code or your images. And the output
then could include not only those summaries,
but also the raw images or code itself. This approach is
rather straightforward, but does risk some information
loss because we are generating some text summaries. In the second approach, we
could use multimodal embeddings across the board. That is, we take our multimodal
data, like our text, images, audio, video, and we turn
them all into embeddings in the same embedding space. And we store all of these
items in a vector database. Now for retrieval, we
can input our items in different modalities. We could input text. We could input images. Again, that's turned
into an embedding and matched against all of
our multimodal embeddings. Then we retrieve
the top results. We get the actual
raw assets back. And then we use a
large language model to summarize everything for us. So this approach generally
has higher accuracy. And there's a little bit
lower information loss. But it does require that
we have pretty flexible multimodal embedding
models available for us that can take images and texts
of various sizes and lengths. So in the next section,
we're going to dive deeper with a live demo. And in the demo, we'll
use the first approach that we see here, which
utilizes text-based embeddings for multimodal retrieval. So Shilpa will take
it away for this demo. SHILPA KANCHARLA: Thanks, Jeff. You gave us an amazing
intro to the Gemini model. Over the next few
minutes, we'll now witness how Gemini 1.5 Pro
can be used to perform RAG. Today, let's imagine
we purchased a new car. It's the 2024 Cymbal
Starlight, a modern vehicle, and we want to ask some
questions about it. For example, what does this
light mean on my dashboard? Is it bad enough that I have to
take it to the shop right now? Or how many miles until I
need to get an oil change? Or what if I broke down on the
freeway, how do I ask for help? Now, if we ask an LLM
any of these questions, we're not going to return
with really great responses. For example, I can
ask Gemini 1.5 Pro how many miles can I drive
my 2024 Cymbal Starlight until I need to change my oil? And it responds with a
bunch of information that isn't really relevant to us. And this is because the LLM
wasn't trained on the 2024 Cymbal Starlight manual. Why? Because our team
made up this example. And it doesn't actually exist
in the corpus of training data found on the wild. And this is a
constraint of LLMs. But to solve it, we
can use RAG, which augments the prompt
sent over to an LLM with data retrieved from
an external knowledge base, like this handy owner's manual. Let's see how it works. Before we get too
code heavy, it's important to understand
the preprocessing components of a RAG pipeline. In this pipeline, we begin with
a PDF or a set of documents. We split them into
their component parts, so text is split
into smaller chunks. And images and
tables are extracted. Each of these chunks,
images, and tables will then be turned
into embeddings. Then these embeddings are
stored in a vector database for fast lookup and retrieval. Let's take a look at
how to achieve this by moving over to the notebook. We'll be walking through
a Colab notebook today. And before the session,
we took some time to get connected to
a runtime, installed a number of useful
libraries, and did some setup like defining a
Google Cloud project. So let's skip all of
that for now and dive straight into the demo. We'll begin by downloading
our source data. As mentioned earlier,
this is simply the PDF version of the 2024
Cymbal Starlight manual. Here, we downloaded
the PDF document that contains a mix
of textual information and visual elements
like images and tables. And we can now see in
our Colab environment as well in the left hand panel. As we showed in the architecture
diagram a minute ago, we first take the PDF and split
it into its component parts. In this cell, we extract the
images and tables from the PDF. We also split the text
into smaller chunks based on section titles and
headers found in the source, as well as the content length. This also efficiently
processes and generates embeddings for each element. It also improves
accuracy in retrieval because you can retrieve a
specific part of your text, rather than, say, an entire page
or a group of pages, et cetera. While this is running,
we can take a quick look at the PDF document as well. So here's our PDF. And if we scroll through, we
can see some visual elements as long-- yeah, visual
elements right there. And if we keep going, we
can see a table as well. But as you can see, you can
see how perhaps this PDF might be split up into
different chunks based on the section
titles or section headers. And now once this
is complete, we can see the images extracted
locally in the Colab. Yep, there they are. And we can also see the
sample data as well. Moving on, additionally,
it's important to categorize non-textual elements we
extract into tables and texts. These will be
treated differently in our RAG pipeline as
well as the Gemini model to transcribe these
tables into text and summarize long text chunks. All you need to get
started is an API key in order to get access to
the Gemini 1.5 Pro model. We have a key here. And we kept it in Colab Secrets. Next, we define our
summarization model as Gemini 1.5 Pro. We define a prompt that
instructs the Gemini model to generate concise
summaries of the extracted texts and tables, which will be
optimized for retrieval later on. In this demo, we only summarize
some small tables in the manual. However, you can use
this as a reference if you're dealing with
larger amounts of text, say like with a textbook. With that taken care of,
we can turn our attention to the images that
might reside in the PDF. We've extracted each image
in the manual locally. And we now need to generate
summaries about each of them. We define a prompt similar
to our text summarization prompt just shown a minute ago. But this time, we'll
instruct the Gemini model to describe the content of the
image, focusing on key elements like tables, graphs,
and their meanings. The concise summary
is represented as text and will eventually be used
for embedding purposes. We'll begin by encoding each
image into a base 64 string for embedding and transmission. This is a useful step whenever
you're transferring images only to a text only system
and processing them. After this, we
generate summaries for each of the images using
the prompt that we previously defined. And we use the Gemini 1.5 Pro
model again to carry this out. From here, we have our
raw image summaries. These will be turned
into embeddings later on for retrieval. Let's take a look
at one of our images and the summary generated
by it by Gemini 1.5 Pro. OK, here's the first example. We see this coolant
temperature warning. And you can see a red symbol
against a black background. And as you can see
in the summary, it describes exactly
that as well as the description of how the
red pattern looks like. And let's take a look at
our second example, as well. So here, we have a low
tire pressure symbol. And we can see a yellow symbol
against a black background. And that's exactly
what's described as well. Yep, and now that we've
taken a look at that, I'll turn it back
over to Jeff, who will explain different
components like vector store that will come
in the next steps. JEFF NELSON: Awesome. Thank you, Shilpa. So in order to retrieve the
embeddings that Shilpa just spoke about, we actually
need to store them somewhere. And Google offers an array
of different vector databases to do so. Today, we'll be using
Vertex AI Vector Store to house our embeddings. And this cloud-based offering
enables millisecond lookups and uses the same
infrastructure that's used for Google
Search and YouTube. You can see that we input a
couple of different parameters here, one of them being
the dimensionality of our embedding model and
a couple of other items. Next, we deploy-- we
define an index endpoint. And this is going to allow us
to actually use this vector store outside of-- within this notebook. And lastly, we deploy
that index endpoint. So that we can use it. And this takes a couple
of minutes to run. So we've run it beforehand. And now, it's time to
turn all of the embeddings that we saw into chunks. So we'll do this next
with a vector store. We're going to define
this with LangChain. We're going to define this
vector store using the index and endpoint that we
created just a minute ago. You see a couple
of other items here around our Google Cloud
project, and importantly, the embedding model. We'll be using the Google API or
Gemini API embedding model 001. And we'll stream
all of these updates directly into our vector store. So I'll run that. And next, we use a
multivector retriever. The reason is our vector
store will only contain the embedding and an ID. But we use a document store
that's indexed by that same ID, so that we can access
the original source documents after searching
through our embeddings. In other words, when we do
the retrieval part of our RAG pipeline, we don't
just want to retrieve a bunch of lists of numbers. We also want to retrieve the
original images and the text chunks in order to
feed them into Gemini. And with all of that
setup taken care of, we can now load
our raw documents into the document store. And we can additionally
generate embeddings for all of our text
chunks and image summaries that we created
earlier and stream them into our vector store, again,
Vertex AI vector search. So I'll run that. And we'll see some things
start to populate below. Each of these UUIDs here
represents an embedding that was just streamed
into Vertex AI vector search in real time. So the next step is
actually accessing all of those in our Q&A pipeline. So just a reminder, to go back
to this slide, our Q&A pipeline, when we ask a question, we turn
that question into an embedding. And then we search across
all of our embeddings in our vector database. We get the matches. And then we turn these-- we join these back
to our document store to get the original
text chunks, any summaries, any images. And we feed all of this into
our large language model to give us a great answer. Whoops, sorry. So we're going to begin by
constructing a RAG chain, which is just a sequence of operations
that forms our core application. And it contains a couple
of different components. It first starts by finding
any relevant information from our document input chunks. This includes both
text and images that are identified
using the input query. Next, we use the
image prompt func that instructs Gemini to act
like a helpful automotive expert when it provides its responses. And this preprompt helps the
Gemini model use verbiage around the topic at
hand, auto repair, rather than acting
just as a generalist. This returns better
responses for us as well. And lastly, we also have
the chat generative-- Google generative
AI class, which allows us to access the
Gemini 1.5 Pro model and send through not
only our context-- our user's query, but
also the retrieved context from our vector database. This generates a
coherent response and gives it back to the
user in a readable format. With that, with
all of that set up, right now, I can define a
multimodal search function, which is kind of a
wrapper around everything that we've just done. The multimodal search function
takes a text query as an input. It then runs a vector search
to identify texts and image summaries that are
the highest match. The chain then pulls the
raw components, like images or raw text chunks,
and summarizes them using the Gemini 1.5 Pro model. It then prints or displays
all of these texts and images, we see a lot of print
statements here, so that we can just be
assured that we are actually pulling the right things in. The result is a summarization
that's grounded in our truth, or in our input data. OK, so I'm going to go
back to that question that we asked earlier. I'm going to ask
how many miles can I drive until I need to change the
oil on my 2024 Cymbal Starlight? And when I run this, again, the
text query is being run through. And we get four
different chunks back. And these chunks are taken
directly from our input PDF. Those are then all
summarized by Gemini 1.5 Pro. And we get an answer. We should change our oil every
5,000 miles or every six months, whichever comes first. So we're now able
to get an answer that we couldn't get
before because we're grounding our data-- we're grounding our results
in our first party data. Let's ask another question. And I'm going to use
an image this time. And this is a kind of
apt timing because I have an image from my own car. Let's pretend that I drive
the 2024 Cymbal Starlight, and I have my own image
that I took yesterday while driving over here. Let's take a quick look at it. I uploaded it from my desktop. And let's take a look. We see that I am driving in
front of the Google building, but I also have a light on my
dashboard that just showed up. So I'm going to ask Gemini
to act as an assistant that can describe warning
lights on automobiles and give me a description
of that warning light. So we get an answer. There's an amber colored icon. It's shaped like a horseshoe
with an exclamation mark in the center. And I think that's a pretty good
description of the error light that we're seeing
here on my dashboard. Then I'm going to add a
little bit more to this text. So I can say, what does
the following mean? What does this
following light mean? And I'm going to pass that
through the multimodal search function that we
created earlier. Again, we retrieve a
couple of text chunks. And it looks like we
retrieved some matching images from our source
documents as well. Now, the top one doesn't look
like it's totally relevant, but the bottom one
definitely looks like what we intended to retrieve. It looks just like what
was on my dashboard. So I could validate
that looks pretty much like what we'd expect. And then I also
get Gemini 1.5 Pro to summarize an answer for me. It's saying that the
icon on my dashboard is a low tire pressure
warning, a TPMS warning light, which means that one
of my tires is under-inflated. I should check
the tire pressures and probably inflate
those tires to get rid of that error message. So this is cool, I
can now look at-- I can input an image. I can input some text. And I can retrieve both text
summaries and the raw images themselves. So I can validate that I'm
getting the right answers back. Now, I say, OK, I know
my tire pressure is low. But what should the tire
pressure be on this automobile? So I'll just ask that
follow-up question next. Again, we retrieve a number
of different text chunks. And we receive an
answer that's summarized from our grounding
data that says I should pump up my
front tires to 35 PSI and my rear tires to 38. So I should probably do that
over the next couple of days. So I can clear that error
message from my dashboard. We also get a little bit
of additional information from our owner's
manual that tells me where I can find that
tire pressure info if I were so inclined to do so. But it's pretty
convenient to just be able to ask these
questions from my RAG pipeline instead of having
to scroll through or look through hundreds of
pages of an owner's manual to get that answer. RAG was able to generate this
answer using our first party data and even show an
image from our manual. So I know it's the
right warning light. And that's the end
of the demo here. So let's move back
over to the slides. So just to summarize what
we spoke about today, RAG is a powerful
architecture that combines information
retrieval techniques with generative
large language models and enhances the quality and
relevance of our generated text. The main idea is to provide
the generative LLM with access to external knowledge
sources, like your first party data or your enterprise
data, and allow it to retrieve the
relevant information and incorporate that into
the generation process. And although today's
examples all related to cars, we're not actually confined
to any one single industry. So we could talk about maybe
the technology industry, where multimodal RAG could help
accelerate complex legacy codebase migrations. We could analyze code,
architecture diagrams, any chats between developers,
any documentation. And we could suggest any
migration strategies. We could identify similar
modern code samples and really speed up that time
to value for the migration. In the retail sector,
a customer might have-- a customer might ask a
question like, where can I get a shirt like this? Or where can I get pants
in a different color? They might even upload a
pair of pants or some shoes and say, where can I
get something like this? Do you have this in stock? Again, RAG can analyze the
question that's from text. It can analyze the image. It can analyze the video, and
check against your product catalog, and craft a direct
answer and summary back to the user. Not only that, but it
could include product links and maybe other
recommended products that the user might
be interested in. In the media and
entertainment sector, a user might be looking for
a movie on a Friday night. And they might want to input
a combination of text, images, or audio to get the type
of mood of the movie that they're looking for. So I might say I want some rainy
day vibes because it's raining. I might upload a picture
of a cozy fireplace. And I might also say, hey, I
love this mellow jazz track. Multimodal RAG could
take all of these inputs, synthesize them, and
extract the emotional tone, the visual aesthetics,
the audio cues, and provide me as the user with
the right movie for that movie night. So this slide contains a
list of how to get started. The Gemini API landing
page is the first link. And it's the quickest way
to get started using Gemini. All you need is an
account and an API key. We also have Vertex AI, which
is hosted by Google Cloud and now provides over 150
foundation models, including Gemini 1.5 Pro, allowing you to
run your workloads at enterprise scale. And we also include a couple
of links to code samples. First, to the Gemini
API repository, which includes references to a
number of different notebooks and also to our Cloud
Vertex AI GitHub repo, which contains dozens of
notebooks from how to get started to things like RAG
and agent building and more. So thank you so much
for joining us today to explore multimodal RAG. We hope you found
this session valuable and it can inspire you to
build something on your own. [MUSIC PLAYING]