- Hello, welcome, everybody,
thank you for joining us for this session on generative AI. I know that there are no other sessions about generative AI at
re:Invent this year, so I'm grateful for you
joining us in this one. So in this session, we're
talking about architectures and applications in depth. Hello, my name's Mike Chambers,
I'm a developer advocate for Amazon Web Services and
pretty much specializing in generative AI. - And I'm Tiffany Souterre,
also a developer advocate at AWS, specialize in AI also. - Awesome. Should we get into the agenda of what we're gonna cover today? - Sorry.
- Oh no, that was, yeah. We've both got clickers,
what could possibly go wrong? - I just, oh dear, sorry. So for the agenda today,
we have a generative AI, a quick recap so hopefully,
you have some understanding of what generative AI is. Then we will go through the question like what is retrieval
augmented generation, RAGs? What are agents? We'll go through security
audits and compliance for your applications with generative AI and a little bit of
predictions for the future of generative AI in the tech world, yeah. - It's predictions where we hallucinate, probably-
- Yeah, exactly. - All right. - So we have a poll question for you guys if you want to flash the QR code. The question is, are
you using generative AI in your workloads or workflow today? Are you using generative AI? - I am, strangely enough, yes, because I put presentations
like this together, but yeah, absolutely. I think generative AI has
been a huge impact on my life and the life of my family as well. So yeah, using CodeWhisperer, for example, to help me write code,
what about yourself? - Yeah, well, I mean,
even in the workflow, like maybe you guys also use, oh- - We're in the wrong room. - No? Why are you even here? - No, well, we're here to
talk about it, I guess, right? - Or maybe trying to understand why- - Look, and we've got some
more questions coming up, which I think sort of going,
may sort of reveal this in a little bit more depth. I don't if you find this interesting, but I think that polling
was kind of interesting. And also, we welcome the other
rooms that are joining us as well via the stream 'cause
other people up and down the Strip are watching this. So yeah, okay. So let's dig into this in
a little bit more detail. So I guess we start off
here, we wanted to start off with a quick recap of generative AI. I think we're making the
assumption at the moment in this session that everybody
has some kind of familiarity with generative AI. Let's go for the really
simple quick start. So generative AI, it's machine learning, it's artificial intelligence
that generates content. A lot of the machine learning
that we've dealt with back in the old days, like last year, was all about making
predictions about what data was, so here's a picture of a dog,
here's a picture of a cat, great stuff like that. But in this case, we're able
to type in prompts to models that can do these generative tasks. And I'm sure we've all had a
go with putting in a prompt and asking for a picture to be generated. A golden retriever wearing glasses, a hat in a portrait
painting, I did type that and it did make that, and
all going to chat bots and typing in prompts
and getting answers back, so in this case, this is
one of the open source large language models asking
it, what is generative AI? It came back with its answer there. So generative technology, maybe
we're all on the same page and we've all had to go with this stuff. So at the center of all of
this is the foundation model. Certainly, the center of the,
you know, the application is that we build as
builders and developers on top of this technology,
there is a foundation model. And these foundation models
are machine learning models that have been trained with
a transformer architecture. We talked a little bit more about this in a 200-level session that we
had just before this morning. So we are not doing again,
but it'll be up online at some point, so BOA 210
if you want to have a look at that once it's up online. But the transformer architecture was really the pivotal moment
when using this technology, we're able to train really big models, so we're able to take data that we have. And a lot of the time
in these presentations, we're probably gonna end up speaking about large language models, LLMs, because that's really where,
at the moment, the majority of the enterprise and business
and social and real value is, it's great that I can go into a prompt and get a picture of a golden
retriever wearing a hat, but there is so many more things that we can do with this technology, large language models can do those things. So we put all of our data and
train our foundation models with compute, I've said GPU
here, this could be other kinds of PUs, but we're talking about
the types of architectures which are optimized at training the transformer architectures. And we may be aware of this,
we're probably familiar with this, these are not small things, these are not things, this
isn't a rainy weekend project, which is, do you know what,
I'm gonna make myself an LLM. That's not really how these
work, not the big ones anyway. And so there is a lot of data and a lot of compute power required
to generate these models. And and this has really been one of the transformative
things and why we're talking about generative AI now as
opposed to a couple of years ago, far less of us were
talking about it anyway. And it's because of this
gradual change in size, this growth in models, it's
almost not worth trying to keep this slide up to date. The point of it is that models
used to be really small, like back in the old days and
very very old days, we dealt with tiny little models
with like a few hundreds of parameters, then some
millions of parameters. But that line on the graph,
that's, in the 2016-2017 frame, that 2017 point is when
transformers emerged and when we discovered this ability to make much larger models. And that's really been the key to this, is as the models have got bigger, so have their emergent
properties have been discovered, like we've made these big
models and said, oh my goodness, they can do all of these things. And so we are finding that models are not mathematically, exponentially, but language exponentially growing and getting much, much
bigger, what's gonna happen in the future is a bit of a
guessing game for everybody. In reality, what's happening is that this is fanning out, right? So models are getting much, much bigger, and there is a push to get
more and more capabilities from the models by
making them much bigger. But they're also finding that
we can train smaller models which do other tasks quite well. So I know this isn't, you
know, there are new versions of Llama 2 and Falcon since
this was put together, but there are those smaller models too. So we're getting a lot of capability, a lot of spread of models. Generally, especially
compared to the old days, they're all much, much bigger
than we've dealt with before. Now, I said at the beginning,
we're talking largely about large language models
here, large language models, they are a type, of course,
of foundation model, but they're text-based,
so they're the ones that we're gonna be
using for our chat bots and for making text completions. But a lot, lot more underlying this, and I think this is one of the
things that, for me anyway, it just absolutely blows me away, I mean, even when we were talking
about this this morning, it still just amazes me
that this technology works. But really, all these things are doing is making next word prediction. Next token prediction, I guess
we would say, we understand that they work with tokens, not words, but that's all they're doing. We're sending these
prompts in and the models are figuring out what
the next word might be and then cycling through that
again and again and again, we get some kind of output we can use. And that is where the,
it's the root of all of this amazing emergent
properties that we've found. So we enter a prompt and
the LLM adds a completion. But it's such a simple concept, very complicated from a
mathematical perspective, but just by predicting
the next word, we're able to create all of this quite
astonishing capability. So the large language
models that we produce, one of the things which sort
of sets them apart from models that we've worked with in the past, again, apart from size alone,
is that in the past, with natural language processing, we would create models
to do specific jobs. So we would create models
to do text generation, we would create models
to do text summarization, et cetera, et cetera. Translation, I love translation,
like we could translate from English to French,
but you can also translate from natural language
to a computer language, so from natural language to Python, which I find super useful and maybe some of you have used that,
sentiment analysis, et cetera. What we're finding is that we can use the large language models
as general-purpose models, they can do all of these
things and a lot more. We say more, dot dot
dot, that's important, we'll be coming back to that. But they can do all of those things, we don't necessarily have to
take a single-purpose model. And we'll see what the future brings, there will be specialized
models, et cetera, but these large language models
are super, super capable. We're not gonna throw quiz
questions at you, well, they're not quiz questions,
or survey questions at you the entire time that we're on stage here. But we've got another
question for you here today. And I'm super interested in this one. Of all the questions I've
probably asked any audience, I'm very interested in this
one, but what is stopping you from building with generative AI today? There's some multiple options on there that aren't gonna come,
they'll come up on the screen when we get the answer.
- Do you have a guess? - I have an idea, I don't know. I don't wanna predict anything. What's stopping you from
working with generative AI? Not much, we are working with it. - We are. If we could not work with generative AI, we wouldn't be working at all. - That's right, that's true, yes. - But-
- We'll take some time. - I guess we're-
- All right. Ah, interesting, oh,
interesting, oh, it's changing and it's live, this is good. - Privacy and security. - Privacy and security. Okay. There's more votes coming in. So all right, there's a
whole cohort of people that there's nothing
stopping you and I guess, ah, the devil's in the detail of this, we'd just like to have so
much more interactivity. There's nothing stopping
you and you are doing it, or there's nothing stopping you but still, you are not doing it, I don't know. But privacy and security, that's cool. We'll definitely be talking
about that in this session. - Yeah. And so for the next
slide, so one of the thing that actually, and some
of you have answered that question also, answered that answer, is that what could prevent you also from using foundation models
also is hallucination. And that's because some of you
might have experienced that with LLMs, but sometimes,
you ask a question, and because it really wants
to give you an answer, sometimes it gives you
something that is not accurate or that is probable of an answer
but is not factually right. And this is something that, if
you use LLMs in your use case for your business, you
don't want to have some of the answers that were wrong. You want this to be accurate all the time. And so you need more
control over your LLMs, you need to customize them
for, they have to be tailored for your business to make
sure that the answers that the AI is giving you is accurate. And also, there's the
other limitation is that, once you've trained
your foundation models, their knowledge is limited to the time at when you have trained your data. So let's say you have trained
your LLM back in 2021, then it will only know- - For example.
- For example. - For example, I don't know
where you got that date from. - Yeah. Well, it will only know the
things that you've given it as a dataset for training
up until that date. But everything that has
happened after that, it would not know factually. I mean, it would be
able to give you answers that are probabilistically right, but the facts are not there. - Yeah, I think a really
interesting way to put this is that large language
models at that point, they have a good sense of the world. So they know that up is
up, the down is down, the sky is blue, but they
don't know about you, your business, your social enterprise, whatever it is you're doing,
they don't have those facts. - Exactly. So one of the things that you could do to gain more control over your AI is using retrieval augmented
generation and agents. We're gonna see in a minute
what this actually means. So we're going to show
you the general idea of what a retrieval
augmented generation is. So basically, you have your
LLM and you give it a query. So it is your prompt, this
is the question you ask to your LLM, and what you wanna
do is to give it more data. You want to inject data that will help your LLM understand better your question and make it sure that it has all the facts to make sure that the answer is accurate. And the source for the data
could be different things, it could be a vector
database, it could be an API that helps the agent
to retrieve information and then put this in the prompt. So basically, it's not very complicated, what happens here is that
you just augment your prompt with your initial query plus the data that is relevant to the question. - And I think a real good call-out here is this is what we're not doing. So the first thing that,
we're not going immediately to the step of fine-tuning,
we're not immediately going to creating our own foundation model or doing continuation training. All the things which
you absolutely can do, but the first thing
that you should look at is prompting, prompt
engineering, and this, retrieval augmented
generation, which is not as complicated as I think
most people make out. - Yeah, exactly, 'cause you
might have heard of fine-tuning, but this requires a lot
of compute like power and you need data and it
takes time and it's costly. When you do retrieval
augmented generation, you only need to update the data, like the source of the data
that you send to your LLM, and this is why your data is fresh. So now your foundation model
has been trained up until 2021. But then in the data, in the
vector database or the API, you give it more information
that is up-to-date and this is how it's able to generate a more accurate answer. So the mental model. We have, so we know that those concepts can be very complicated,
and we like to use metaphors to like kind of wrap your
hand around what those RAG is. So an analogy that we found
with Mike very useful is- - [Mike] Don't blame me. - No, no, is the wizard analogy. So basically, what a
foundation model is here is the wizard student. So at the beginning, you
have a wizard, it doesn't, a student, it doesn't know
exactly how to spell cast, it doesn't know much about magic, and what you can do is to
send the wizard to school. So they have pre-training. So this is where they learn
about the world of magic, like general knowledge about
everything that is magical. But it's unable to cast specific spells. Like right after school, if
you want to specialize in a, let's say a domain of magic- - [Mike] There's a troll in the dungeon. - Yeah, exactly-
- We all know what it is- - For example, you would not
be able to kill the troll- - I don't know how to do it. - So what you would need at that moment is probably to have a
book of spells with you. And that book of spells is the
data that you need to be able to basically kill the
troll like that is with you or answer a question
but, or also specialize in a certain area of
magic, like we say before, like maybe you want to
specialize in transformation, like you want to transform into an animal or you want to specialize in, what's the other areas of magic like- - I don't know, I think
you're digging a hole. - Right. So, but that basically
what it is, so the student is your LLM and the spells is your data, your vector database. And, oh yeah, that's the fine-tuning part. - Yeah. So what might happen if the wizard goes and looks at the spells
and doesn't understand the terminology of those new spells? - Then you can send the student back to school to refine the training, like to refine the
general knowledge of it. And that's basically how
you can customize your LLM. And now, going back to a
little bit more of code, 'cause we know it's a 300-level session, we're going to show you how this works. - [Mike] The wizard's
coming back, by the way, so if you wanna take pictures and say, I was at a 300-level talk and
they were showing me pictures of wizards, you'll have
your opportunity to do that. - [Tiffany] Of accurate architecture. - [Mike] Yes. So, and somewhat it's in
this demonstration too. So what I've got here is some code. When we've been talking with people and exploring retrieval
augmented generation, there has been some confusion, I'll be completely clear with you, so, of looking at the relationship between, especially when we're talking
about vector databases and vectorized information and the actual large
language model itself. And in this demonstration,
I'm gonna show you some code. It's not necessarily the
prettiest code in the world, and I'm not suggesting you'd
use this in production either, by the way, but I'm going
to step through an example of retrieval augmented generation
so we can actually see it. Because very often, you'll
be implementing this with a library such as
LangChain or with products and services that do all the
undifferentiated heavy lifting for you. This is a very hands-on, no libraries, kind of some libraries, but
no real libraries at work. So we can see exactly how it's working and how the large language model and the vector database are separate. - [Tiffany] Yeah, just to make sure, don't do this in production. This is just to explain
to you the concepts, but then you have tools to, you know, manage everything for you. So this is a very hands-on
example of it, but don't do this. - [Mike] They said don't do this- - [Tiffany] Don't do this in production. - [Mike] What's the point then? Okay, so it's an example, and I'm wanting to illustrate
the point in Python. So at the beginning of this
notebook, so I've got a notebook of code, at the beginning
of this notebook, I'm importing some libraries, and we'll obviously step through and see where these
libraries are being used. The key things here is that
we're using Faiss, F-A-I-S-S, the open source vector database
in memory from Facebook, so, or Meta, that's just being
used just for convenience sake. It's used a lot in examples
for this type of thing. And then we're using
Boto3, the SDK for AWS, and Jinja to do some templating. So the first thing that I'm going to do is I'm going to create for
myself a Bedrock client. So I'm actually using Amazon
Bedrock behind the scenes to do this. So I'm going to be
using a couple of models for Amazon Bedrock, the embeddings model and then one of the large
language models as well. But you could use any embeddings model, any large language model,
and experiment around in a similar kind of way. But getting myself this client
means I can move forward. So the next thing I'm going
to do is load up my documents. And so I quickly last night
thought, do you know what, I feel like we should
make this about spells. So we've got a whole bunch of spells here. There's a special secret
one in there somewhere if you can find it, we'll
find it in a moment. So these are basically just text strings, I know these represent
documents, essentially. So the documents that you might
have in your document store. Small, simple, something that
we can just run here today. Remind me as I scroll through
this to run these cells because otherwise, we will have a bad day. Okay, so I've run everything
so far, that's cool. So the next thing we're going to do is we're going to vectorize
all of that information. So what that does for us is it takes all of those text documents,
spells in this case, and it's going to convert
them into a vector space. So we use an embeddings model to do that, and that basically takes the
text string and converts it into an embedding space of
1000 and something vectors. And it would do that,
if it said hello world, it would be that big, if it
was an entire 10-page article, it would also be that big. So there's nuance and careful crafting when you're doing this in
production about how large a text string you wanna
put into the vector. But for now, we're gonna
put the entire spell in. And so this is just a simple
function that I've got, which is gonna help me do that. So it's going to take the text, the spell in this case,
and we're going to run it through the Titan embeddings model. So these are just some
keyword arguments here, which help me to call the
model from Amazon Bedrock. So Amazon Bedrock has a
fairly standard interface, an SDK-level interface. And so this construct here
will look pretty similar for each time we're calling the model, whether we're generating an
image with Stable Diffusion with my dog and a hat and whatever, whether we're doing text generation, or whether we're doing this embedding. And this line here where
we just call the client that we got, so bedrock_runtime_client, and call invoke_model, this
line is exactly the same for any of those
generations that we're doing or any of the embeddings that we're doing, just passing in those
keyword arguments there. So we've got that set up,
we've got our function set up, ready to go, and it's just
going to essentially return for us the embeddings,
just extracting that out. And so from this section
here, which is very long, lots of Enters there, we can now go and do that for all of our spells. So all I'm basically doing
here is creating a NumPy array where I grab all of those
embeddings that will be generated and put them into that NumPy
array, so let's run that. And any suggestions for code improvements, please feel free to let me
know, not now, don't shout out. So that took a little while to run, you probably noticed that. And that's because it
was running through each of those spells, sending
it off to the model, getting the embeddings, bringing it back. So creating embeddings
is not gonna be instant, but it's gonna be pretty fast and it's gonna be much faster than trying to fine-tune a model or do
continuation training on a model using especially that
limited size of a dataset. So we've got that done now, so I should have my spell
embeddings all set up. Let's do this, let's risk
everything by writing more code. Let's just literally just print out the spell embeddings to the
thing, so yeah, you can see. So numeric data, the embedding space for those different spells. It's a one-way process, we can
convert it into embeddings, we can't bring it back again. So we are using this or we will use this to find the data from our database. The original documents essentially need to be stored somewhere,
which they are, of course, further up in the notebook,
we have the original list. We still need to have that
because we're gonna use this as an index and we are
gonna get our data back from the original list. So we've got our embeddings,
what I need to do is I need to go and put that inside of the actual vector database itself. So by doing that, I
have the capability then to be able to perform queries on that. So this is a simple line, we're gonna create a magic bookshelf. I haven't shown you all
of this code, have I? This is called the magic
bookshelf and this is the index, essentially, you would call this index if you're writing proper code. With that index now,
we've gone and got that. And I can just run this, oh, sorry, that's creating the index now, we're gonna populate the
index with all of our data, and just outputting the length of that. So we can see we've got
21 spells, documents, pieces of text, vectors,
inside of our vector database. All right, so up until this point, we've done nothing with
large language models. We've only dealt with vector,
sorry, with embeddings models, and we're gonna carry on
like that just for a moment. So we want to ask a magical question, and I know I'm wanting to know this, but how can I become a fish
seemed like a good idea at the time. So how can I become a fish,
that's the question I'm going to ask my spouse, and I know
that the answer's in there. And obviously, in a larger
system, you'd be able to ask more nuanced and
interesting questions than that. But how can I become a fish? The first thing that we need
to do with that question then is we need to turn that question
itself into some embeddings because what we want to do is
we want to get the database to query and do a similarity search to find all of the vectors that it's got, which are all the representations
of the spells we've got, and find the ones which are closest, like in Euclidean distance, closest to the vector of our question. So I'm gonna embed my
question, and just for the sake of being able to see what that looks like, let's just take a quick look at that. So there it is, there's my
embedded question, very long. It's 1000 and something places
in size, numbers in size. How big that vector is depends
on the embeddings model that's being used. So I've got that now, and so now I can go and query my index. So I'm going to say k of four, so I'm looking for the
four nearest facts, spells, documents, whatever they are,
the four nearest snippets of thing that are closest
to the question that I have. So I do that here, so I have k as four, I have my embedded query in there, and I'm searching this index that I have. So I can just press run on
that and it's pretty quick, it's in memory, it's
obviously also very small, this isn't enterprise size. And I get a couple of things back. So I get this first
list of integer values. So these are the index
positions of the facts of spells inside of our vector database, which most correlate to this. Now, clearly, I'm setting
myself up for success, so I am asking something
specific about a specific spell we've got, so I am expecting 11, the thing at index
position 11, to be correct, printing out here as well the distances. So the distances between
the vector spaces. So inside of the vector
database, we've got a big cloud of vectors, which are our spells, and we've got our query,
it sits in the middle, what are the distances
to the four closest, and so those are what
those values are there. So we could do something
more sophisticated with those if we wanted to. Okay, so that's basically
it, obviously, there's more, but that's basically it,
we've now performed a query of our vector database, we've
embedded all of our documents, and we've performed a query,
we've got some answers back, something which is like
the most likely answers are clustered there. No large language models have been used at this point at all. We've used an embeddings
model, but that's all. Where does a large language model come in? Well, it only comes in if you want it to in your application, so we do
want to, so let's carry on. So we're going to now construct a prompt and we're going to prompt
our large language model to help us answer the question. And so I'm using Jinja here
as a templating language, a templating model, to be
able to put together a prompt which contains our information. There's a really crucial
part in here as we run this. So let's run this, first of all, so we've now loaded the
string, that's basically it. And I'm saying, you know,
given the spells provided in the spells tags, find the
answer to the question written in the question tags, there's
lots of different kinds of prompt engineering
kind of best practices. You can read a lot from the
different model providers about what works really well. This is sort of taking play from Anthropic and the way that you work with Claude, but we're actually gonna end up using the Amazon Titan model. And you can see in here
with Jinja templating, we're gonna put all the spells in here and we've got the question
will be put in here. Let's go and fill that out though, so I'm literally just using Jinja, if you're not familiar with that, it's essentially a
templating library mechanism that you can use quite easily in Python. And it's basically just
gonna smush the data together with our template. So now let's go and print
out the actual prompt so we can see what it
is, let's go and do that. I think I might need to
just put that into there. Let's run that. And so this is now my prompt, text prompt, that I'm gonna send to
my large language model. Key, key concept here,
the spells, the facts, the things have been
entered into this as text. We are no longer talking about embeddings. So the embeddings model that
was used to vectorize our data was only there to help us
do that similarity search. And then we took the text that we got back from that similarity
search and we've put it as text into this. The large language of the, yeah, the large language model has embeddings. - [Tiffany] Oh, well, just to
make sure that this is clear. Before having a RAG, we
would just send the question like how can I transform
into a fish to the LLM. Now everything that
Mike has just shown you is how he recovered relevant data in the vector database
to augment this prompt. And this is like prompt
engineering, basically. He is putting more
information into the query to the LLM to make the LLM able to answer accurately to the question. - Absolutely, it's what, again, on that slide that you looked at before. And so, yeah, so now we're
going to go and go ahead and put this into the
large language model. The large language model
also has embeddings built into it inside of the
multi-headed self-attention and all that kind of other
stuff completely separate from the embeddings that's
used and the embeddings model that's used to get the
data into the database. So we have our prompt,
it's just a text string, it's the secret of prompt engineering, it's just text strings. And we're going to now
send that into the model. So this is now something more
specific to Amazon Bedrock. I've got my keyword arguments that I'm gonna set up again here. So in a similar way that I did when I called the embeddings model, this is now how I call
this particular model, so this is the Titan Text Express v1. So a model that I think became generally available yesterday. So it's now available in Amazon Bedrock. You can enable the model access, if you're not sure how to do that, come see me afterwards and
I can help you do that. And we've got all our keyword
arguments that we can send in with max tokens and
all that kind of stuff. Interestingly here, I've turned the temperature down to zero. So if you're familiar with temperature, it's about how creative
you want the answer to be. And in this case, because we're
dealing with specific data, which actually is very
likely to be in the prompt, we don't need it to be very creative. We want it to be quite factual. We wanna avoid hallucination if we can. And I'm not gonna tell
you that this architecture can eliminate hallucination,
but it will definitely help move away from it. So we've got, I think I ran that already, but let's run that, so we've got our keyword arguments
there, and I can just run it through this section of code here being, it's the sort of boilerplate kind of streaming body object
response from Python if you're familiar with
using the Boto3 SDK. So we can basically call again this line, exactly the same line from
before, if you remember, to invoke model, we're gonna
get back our body response, and we're gonna load out of
that the generation, press play. Pray to the demo gods.
Yes, I'm running this live. Yes, they said you should just
record a video of doing it just in case it doesn't
work, but I have faith. And you obviously have faith too. So to become a fish,
are you ready for this? To become a fish, you need
to puff out your cheeks and say bloop bloop bloop I'm a fish. Thanks. I was hoping you'd do that. I have another question
which I'm gonna ask as well. So here's the, ooh, so the special spell that we had in the middle of here as well. Let's go and grab this one. How can I get started with
any AWS service quickly? So if I just scroll back up
to this just to do one more, where does it set? Here is our ask a magical question, asking you about the magic of AWS. So I'm gonna set my
question, I'm just going to run these cells again,
Shift + Enter all the way down. Here is my new set of potential answers with number 10 looking
like it's, or the one at index position number 10
looking like it's likely. Let's redo our prompt. And maybe we'll skip past that, otherwise, you see the answer, it's very easy, as I said, this is a very simple demo, but I really just wanna
highlight how all this works. And if I wanted to do that, to get started with AWS's services quickly,
I open up the console and use Amazon Q. Have you had a go with
Amazon Q? A little bit? Is anyone there? Okay, a little bit. Have a go with Amazon
Q, Adam talked about it in the keynote yesterday. Okay, so thanks for watching that, this basically was to highlight hopefully underneath the surface how you can work and how retrieval
augmented generation works. And I think that's really important. There are different ways that
you can do this very easily, but that's what's happening
under the surface. And I think having that
intuitive mental model will really help to be
able to debug applications that you're writing if you
write them in other ways. - Yeah. So seeing all that helps you
understand what's happening. But now, oh yeah, I'm going
to talk about agents before- - [Mike] Yes, yes, yes, yes. - Yeah, okay, and so now with our agents. - [Mike] It's back. - To have the, finally the
technical architecture complete. The agents are basically, they can do actions for
you, if you give like APIs to agents, they will be able
to perform tasks for you, so it's not just like you
have your foundation model and just ask it a question. Now, suddenly, with agents, you are giving your foundation
models arms and hands. So basically, they can, for
example, if you ask them for an item to buy on a
retail, like randomly, whatever retail company you can think of and then at the end, you ask like, search for this item
for me on this website and then you can basically
ask, buy it for me. And then the agent will be
able to do the action for you. And depending on whatever
action you want it to do, like you said a good example earlier, you said retrieve the time. - [Mike] Yes, super, super simple one. - Super simple one but very useful, if you want your LM to be
able to retrieve the time, then an agent would be the way to do it 'cause then you would
give the API to the agent and then the agent would
do the action for the LLM. - Yeah, ask an LLM what the time is, some of them will tell you what
the time is, but of course, they don't actually know. - And maybe it needs
time to do other stuff, like other tasks after
that, so this is a way to make your LLMs more capable, to enhance their capabilities. So basically, going back to a more- - Bye-bye, wizard. - Yeah, to a more serious architecture, the wizard is the foundation model. And in your example on
the code, it was Titan. The vector database is, so it
replaces the book of spells, but in your example, it was
the list of spells, yeah, that you vectorize and
put in a vector database. And then we have the agents and basically, oh, yeah, the pre-training, fine tuning, so that could be another model. But then basically, how those
pieces interact together is that you send a query
to an agent, then the agent will interrogate the foundation
model and ask basically, do you need more information
to answer that question? If the foundation model doesn't
have enough information, it will say, yes, retrieve
some information for me. The agent will then retrieve
data from the vector database and then, as we showed with the code, enhance the prompt with the query and the necessary data
to answer the question. And now we can finally
talk about Amazon Bedrock because before that, it was
all just, you know, theory, like we were talking about
general architecture. But what Amazon Bedrock does for you is that it is a fully managed solution that creates everything
for you behind the scenes, so you don't need to vectorize your data, you don't need to create
a vector database, you don't need to create agents, everything is taken care of for you. So basically how it shows
here in the console is you go to Amazon Bedrock, then you
can find your foundation model, so here, you can choose
between Jurassic, Titan, Claude, Command, Llama,
and Stable Diffusion. If you want to know exactly
what each foundation model do, then you have a description
of what they're good at doing. Basically, if you wanna
use a multi-language model because your use case is to
speak with different languages and answer in different languages, then you would use Jurassic. If you just wanna do text generation, then probably Claude is
the best option for you. If you wanna generate
images, then Stable Diffusion is your pick then,
depending on your use case, this is how you will choose
your foundation model. Then here, we have the knowledge base. So basically, this is
where you put your data. What it is in practice
is that like practically, whatever data you have,
it could be HTML pages, it could be text, it could
be JSON, whatever you have that you want to use to enhance
the knowledge of your LLM, you put it in an S3 bucket
and then you give the URL to your S3 bucket here. And behind the scene, Bedrock is going to do the vectorization, the embeddings, and put everything in a
vector database for you. And then you add the agents if you want to give more capacity to your
LLMs to perform other tasks, then here are all the
APIs that you can give it so your agents can
perform the task for you. - [Mike] Awesome stuff.
- Yeah. - So now put my pocket
protector in and my tie on, and we'll talk about security,
audit, and compliance. And of course, that was something
that was super important to you when we talked to you earlier on and we did that poll before. And so I'm gonna jump into
another architecture diagram at this point, and it
sort of covers sort of what we did before, but there's
no wizards this time, sorry. So we have on the left-hand side the apps that we are generating. And I just, I'll pick up
on this point as well, we just talked there about Amazon Bedrock and how it can do all of those things. If you've already been
experimenting around prior to the release yesterday of
those services in the keynote, we also have integration
with things like LangChain. So if you've been using the
open source LangChain project, which is amazing project and
very, very active, very busy, a little bit complicated, you
can build those applications, and there is an Amazon Bedrock
LLM library that you can use with LangChain as well. So let's for a moment imagine
that that's the kind of thing that we're doing here. So we've got an app, our
apps, our applications, which are running code,
which are interacting with a large language model and they're also interacting
with data sources, so very typical kind of
thing that you might do with LangChain. If we see a typical
flow, so we're thinking with our security hats on
now about where connections are being made and where
data specifically is flowing, where is the data? So someone asks our app a
question that relates to data that sits in the data
sources that we have. And so our app will take
that question likely, if this is how it's been
architected, but this is common. The app will take that query,
form a prompt around it, it'll do some prompt engineering with a template like we've seen, and it'll send it to the
large language model. Basically, I have some natural language, I'm a Python application, I
don't know what to do with that, large language model, please help me out, tell me what's happening. And the large language model, again, depending on how it's been
prompted, can respond back. And it might respond
back, say, for example, with a SQL query, I really wanna make sure we also understand this,
like RAG doesn't have to be a vector database, it's
just what everybody likes to talk about. You can do RAG with SQL
database, with a CSV file, a text file, whatever. So it could return back
this is how to get the data that will answer that question, this is how to work
with your SQL database. And then the application will
say, excellent, thank you, I know what to do with a SQL query and run that over the data source, get some information back,
and it's a new number or it's a table of data or whatever it is. So it could send that data back
to the large language model so that the large language
model can then process that in context and say, okay,
well, based on the question that was asked, you've now run that query, thank you for that, now
we've got this data, and send a natural language
response back to the application so that the application can
do what it needs to do next. And maybe this is
responding to a user inside of a chat session, whatever
you might be wanting to do. So that's a pretty typical flow, especially if you're using
something like LangChain. A lot of this will
happen behind the scenes if you're using something like
Amazon Bedrock with agents. What's really important to consider here is what's happening here. So all of this data which
is flowing to and fro, I sort of talked about it a
high level just then, about, but I want you to consider
what data is included from what data sources
inside of these transactions which are happening with
the large language model. And in actual fact, especially,
I mean, if you're using, and again, I mentioned LangChain and I do, I like LangChain, I
think it's really cool. It's a little bit difficult sometimes to see what's happening in the backend. And so if you do peer into the logs as you're working with
this, what you can see, and I'm specifically talking about like an experimental SQL chain
that I've been working at, it will say, okay, when it
sets up that initial prompt, the very, very first one
that sends a query over to the large language model, it will say, I'm a large language model, sorry, no, you are a large language
model, you don't know anything about the answer to this question, but you can get the
answer from this database that I have access to. And it does say this in natural language, it's freaky how we program these days. But it also says, I have
access to this database. This is what this database looks like, here is the table, show
table construction, depending on what kind of table it is, here is the first three
or four or 10 rows of data from that database, all of it, so that the large language
model has the context and can understand how
to perform the query so it can create a
syntactically correct SQL query. It's the only way it can do it. So let's all just be clear
about what information is traveling from the application to the large language model,
sensitive data potentially, customer records, not
necessarily that filtered. And so that's fine as long
as we know that's happening, as long as we know we're
keeping that secure. So the security perimeter that we have around the large language
model is important and where it is and how it's operated and whether it's in a trusted zone within our architectural
makeup is important. And the idea here is that we
really wanna wrap all of this, not just the application
and the data sources, like we've always done
forever in security audit and compliance, but it's
also the large language model as well, that is absolutely part of this. It's not just providing
interesting chat responses, it actually is seeing and,
it's seeing and the system that that runs on sees
sometimes that sensitive data, which is fine as long as we
know that and we're putting the appropriate security
controls around it. So in response to that, I mean, you need to look at the security of the architectures
you're putting together, if you're hosting your own model, then you can look at
which server that's on. If you are connecting with
services, you need to know where those services are and make sure you've got the necessary
compliance in place that meets your requirements. Inside of Amazon Bedrock,
there are services, there are capabilities
to help us with this. So part of this is on the logging, auditing logging of the models. So you can turn on logging
for all of the invocations and responses that you're gonna get back from these foundation models. So we can turn on S3 logging,
CloudWatch logging, or both. And so we can end up with
all of that data saved, which is maybe something
that's really useful for a compliance thing,
I've gotta tell you, from a debug perspective,
it's amazing, it's great. If you are working with LangChain and you are a little bit
frustrated by not being able to see exactly what's going
on, use it with Amazon Bedrock, hook up this, you'll see
everything which is going on, all of the prompts and the templates and the combinations of
things get put into there. So that's super useful,
and this combines in with your existing security posture and your existing logging solutions. It either goes into
S3, you can take it out and put it somewhere else,
goes into CloudWatch, again, you can take it out
and put it somewhere else or keep it there or do
whatever you want with it. So that's the sort of logging and audit. From the perspective of the privacy then, and more to the point if you're working with compliant workloads. So if you are working with
stuff that is sensitive to you and you are highly sensitive
about where it goes, one, or if you're working in a regulated work, a regulated industry
where you can't send data over the public internet,
even if you wanted to, even if you had all of the
SSL layers and everything, you were pretty comfortable
about it, maybe you still can't. And so if that's the case,
then we have this option for you as well. So let's just be clear about
it, when you are connecting to AWS services like S3 and
DynamoDB and all those things, the default position is
that you're connecting to a public endpoint, right? It's still authenticated
and it's still secure and you've still got
encryption and it's fine. But there are these architectures that you may be familiar with,
when you have an application inside of a private VPC,
which doesn't have any access to the internet, you are able
to use these gateway endpoints so that you can get
through to these services. So with S3, the easy one,
it's a free one as well, you can have an S3
endpoint inside of your VPC so that your traffic goes directly to it. And you can do exactly the
same thing with Amazon Bedrock as well, or a very similar thing with Amazon Bedrock as well,
it's called private link, it's available for other services
as well as Amazon Bedrock. So you can have your
application sat inside of your private VPC, you can have then no internet gateway there,
so it's got no access to the internet because it's
your regulated workspace. And you can have private
link endpoint that allows you to go directly to the large language model inside of Amazon Bedrock. And then that connection is not going over the public internet and it
opens up the possibilities for you to be able to
host regulated workloads. And, of course, IAM, identity
and access management, is there for you to be able to define your security perimeters. So we're basically
having this architecture with this enormous power of an enormous large language model there. But, and you are completely in control of the security picture. I spoke for a long time. Thank you. - No, that's fine. So a last poll for you, what do you want to explore next in the
realm of generative AI? - Yeah, and then we're
gonna talk about predictions about what's coming up. Super interested to see
what's gonna happen here. - Yeah. - Because you can select more than one, and so maybe you wanna select
all of them or none of them. We can see. We can see what the results
of that coming up are. And we're gonna have not
really any time for questions, I'm very conscious of your
time, and some people will need to get to the next session,
but I'm more than happy, Tiffany, and I'm sure I can speak on your- - Yes, you can speak on my behalf. - We will both be here, so
you can speak for yourself. We'll both be here afterwards
and we can take questions and talk to you in a moment. How are we doing with this poll results? Oh, it's very balanced. - It's, yeah, it's surprisingly balanced. Building agents, using RAGs. - Building with agents. I think, I'm not surprised to
see that building with agents is sort of like that
leading one, if I'm honest. I think that's probably one
of the most exciting things that's going to be happening. - Getting capabilities
to it next year, yeah. - Over the next,
particularly the next year. - Yeah. - Maybe we can move to predictions. But if I press this
button, nothing happens. Can you press a button? Thank you. All right. Where to focus? - Where to focus? - I think my wizarding skills, maybe? No. Let's go. - So yeah, as we said,
and you answered that also in the poll, so customizing
solutions with agents and RAGs, obviously. But also, it was your concern
at the beginning, security. And this is actually a big topic, so focus on governance and security. - I think so, yeah, I think
the, we've experimented a lot with large language
models and generative AI, especially this year. And I think that we
should continue to do so because I think we're still
unlocking the capabilities of it and we're still figuring
out how it works for us. But it's now kind of time to
switch to production mode, and I think a lot of what
we're talking about here fits into that space. - Yeah. Models will get more sophisticated and add modalities for sure, I mean, we've seen like the exponential curve of LMs being more capable
of answering questions, and for sure that we're
gonna see breakthrough in the next, following years, even months. Generative AI will unlock projects that were not previously feasible. - Yes, I mean- - We were talking about this yesterday. - We were, and I think, I
mean this is something as well that I took a little
bit of a lead from this, I'm not gonna pretend
like this is necessarily my own thought. So Andrew Ng, who is
a very famous luminary in the machine learning world, there's a quite interesting
talk actually on YouTube where he talks about what's
currently happening in the space of AI and generative AI. And I think one of the messages from that is that the capabilities of these models, we talked about it before,
about how they're capable of doing so many things, not just, there is not one thing
that they're trained to do. So we are now in a
position where we can start to look around our enterprise,
social enterprise projects, whatever we're working on,
and finding those datasets that actually have value in them, but it wasn't economical to
get that value out before. Now we have these pre-trained models, someone's done a lot
of heavy lifting for us to make these models. And I think now we're
being able to do things which weren't feasible before,
economically feasible before- - Economically and just like technically, it's much more easy now- - Absolutely.
- To do those things. - All right. - That's a wrap-up. Thank you very much for coming. - [Mike] Thank you so much.
- [Tiffany] Yeah.