Getting to Know Llama 2: Everything You Need to Start Building

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] thank you [Music] hello and welcome I'm super excited to be here as you all know large language models have taken the World by storm however we have still seen very little or limited usage of llms in large generative AI applications because of few reasons number one most effective large language models are closed that makes makes it less customizable and lack of ownership second it is expensive to train and run llms and therefore making it hard to build a viable business model and third it is hard to access deploy and learn the effective techniques to make these models work for your business llama was launched in July of this year with open permissive license available free to use and research and commercial use it's it solves the first two problems and today I will attempt to solve number three which is the third Problem by showing you how easy it is to access llama and use it in your application development my name is Amit sangani and I'm director of partner engineering team working on open source projects such as Lama and pytorch my team's primary mission is to make it easier for developers to integrate these platforms into their projects and solve real world problems in this session we'll go through a lot of content we'll start with the basic concepts and build on it going through the actual code and running it by the end of this session you will have enough understanding of llama2 plus you will be able to take the starter code I show you today and use it in your llama2-based gen air application quick disclaimer there's a lot of content in this session and I'll be going fast so please try to follow along as much as you can all the code I show you will be open source and available immediately after the session so you don't have to write down anything so let's start all right let's start with few prerequisites we expect the audience to know basic understanding of large language models and basic understanding of python if you haven't done python coding that's fine you will still be able to follow along with the code now let's start with what is llama llama are state of the art models from meta these are large language models and they come in three sizes seven billion 13 billion and 70 billion parameters these models come in two flavors pre-trained and chat models pre-trained models were trained using all publicly available data sets no data from metas applications or user of meta was used in training these models and chat models are fine-tuned versions of these pre-trained models and they are optimized for dialogue use cases choosing the model requires certain consideration about size quality cost and speed larger models are more accurate and intelligent but they are more expensive and has higher latency whereas the smaller models which are faster and cheaper to run but maybe less accurate than the large models so you have to consider these trade-offs while selecting the model for your generative AI applications we typically recommend you to use the start with a smaller model and then gradually grow into a larger model along with these models we also launched two documents one is the research paper which is gives you a lot of technical details around what the models are how they were built and all the benchmarking and performance uh evaluation numbers in um which you can read it in great detail the other document is about responsible use guide which tells you how to make sure your models are safe for your users to use it let's talk about accessing these models multiple ways you can access these models one of the most simple ways is to register a meta's website download these models and deploy it in your own infrastructure now this is this gives you full control over these models there is no restrict restriction on licensing these models you can use it as many times as you want in your generative AI applications and you can also fine tune this with your own data set the other way to use the model is using hosted API platforms like replicate in this session we will use replicate because it they provide a very simple API on top of the models which you can use it the other way is to use the hosted container platform like Azure AWS and gcp we have Partnerships with all these major Cloud providers and with few clicks you'll be able to provision your VM and be able to use the models for your generative AI applications let's talk a little bit about the use cases of Lama there are multiple use cases these are just few of them one of them is content generation which is the most common use case you can use Llama Or to generate content for your poems for your articles for your emails whatever you want to do the second is chat Bots now chat Bots are becoming really popular this is a common conversation like AI assistant digital assistant you can converse users can converse with your chatbot and in today's session we will show how to build one others are summarization you can have an article which you want to summarize or you can have an entire book which you can summarize as well and programming is a very popular use case as well where people can use the large language models like llama to code to generate code to analyze code and to debug code we launched codelama just recently to do exactly this task so now let's go into understanding how to use llama so what you see here is few of the dependencies which you first want to install and here you see replicate Lang chain and bunch of other dependencies now I will go into more detail but the reason why we have replicate is it provides a simple python client which allows you to may cause to your replicate server which has the large language model which has llama installed there we have Lang chain which allows you to build generative AI applications easily a lot of the nitty-gritty details launching kind of builds it inside their libraries so we don't have to do a lot of that work and we'll show you today how to use that and then we have sentence Transformers which are embedding models which we need to use it if you want to bring in an external data source into our framework and and again I will show you how to do that all the other dependencies like PDF and and PDF to image and so forth are mainly for bringing a PDF file which we will use it as an external data source to bring in and use it in our generative AI application and what you see here is llama 13B which uh the URI of llama 13B which is on our server which will use it so now let's go to some of the most important things uh how to first set up the replicate server so first thing you need is a replica token and I have already got the replicate token which I will add it here in a second and once I have this I will be able to run all the code here now what you see here is after I enter the replicate token I have two functions one is the completion function and the chat completion function now these are nothing but the wrapper apis on top of the replicate function the replicate.run allows you to make a call to the replicate server I pass in the input prompt which we will see how I pass in to this basic function the second function is the chat completion function which is prompt and system prompt and the system prompt allows me to pre-pand to the actual prompt and control the behavior of the large language model Lama in this particular case so let's run the first prompt the typical color of a llama is now because this is the first time we are running it my servers are not warmed up so it may take a little bit of time um but here the whole goal is to see make a very simple call to the server to see how the sentence completion works okay now this will take a few minutes but let's go forward so this particular function the chat completion function now has a prompt and a system prompt and as you can see the typical color of llama is part of the prompt and the system prompt says respond with only one word now we are directing the model to tell us that you should respond with just one word and that is what uh you're basically telling the model to behave in a certain way now we have received our response and you can see that it is pretty elaborate it is a little bit verbose but it gives a pretty good understanding the typical color of llama is lovely shade of brown um there's a bunch of other things in there like more reddish in color and so forth and it looks pretty good now the next prompt uh we should run that and let's click on that and we should see the output pretty quickly now since the server is is warmed up and here you see the output is brown so it responded in one word now our models can do much more than that it can return the format uh the particular format you ask it to do so in this particular example you see the response we expect is in Json format and it does return back in Json format so now this is really important because you can think about bring sending a request to Lama getting a response back and that response is in a particular format which you can plug in into your workflow for other tasks and Lama can do all of that in conjunction with your entire workflow system so let's talk about genei application architecture here's a pretty simple architecture you see users interact with applications applications can be on mobile or web and if your applications need to access large language model then it can access it through and hosted API and it can connect to your platforms now your platform can be your own custom server where you can host llama or it can be on hugging face or replicate or any other server once the user request goes to through the hosted API it goes to the platform platform sends it to the Llama as a user input Lama processes it synthesizes it gets the output back and then from the platform the output goes back to the end user the Frameworks and line chain which you see here is again super critical as well because Lang chain provides you easy to function a library which allows you to do a lot of it it basically creates um it hides a lot of the nitty-gritty details on how to build generative AI applications and it provides you a simple interface so we will use Lang chain as we go through the code today and I personally like it a lot because it makes building applications super easy so let's look at the chatbot architecture okay there are five most important things to build a chatbot the first one is user prompts the second is input safety llama 2 output safety and memory and context and let me explain this in more detail user sends a prompt The Prompt goes to the input safety layer okay and the reason why we have this input safety layer is because you want to make sure that no harmful contact goes to llama now you can add your additional input safety layer as on top of it as well once that input safety layer you the content goes to the context it's it's basically called the context going into the Llama model which then outputs data and and again it goes to our output safety layer and that output content then is sent back to the user now as you see you have multiple layers input safety layers and output safety layers and both can be added on your side llama internally also goes through this safety checks now to build a chatbot we also need to make sure that we have memory because you have to store the previous context to make sure that llama understands what the previous context is and then infer from that the conversation which you are trying to have it with the chatbot so as we basically go through this we'll look at the code as well now LMS are basically stateless they are like HTTP you send a request you get a response back and it doesn't have any context previous contacts so here's an example where we basically send a prompt chat like what is the average lifespan of llama and as we execute that we'll see that the average lifespan of lava is around 20 to 30 years that's what it returns back okay so this single request response is called single turn and you you saw that um that's that's the request and you get the response now let's take another example we say what animal family are they now this is important because it doesn't know the context of they and if we execute this we'll see that it returns an animal family describing it as dragons so clearly this is not right and our model has hallucinated and the reason why this is not right is because it was not able to store the previous question which is lifespan of llama and it was not able to infer that we are talking about Lama so Lama needs to to have the previous contacts if you want to have a intelligent conversation so the next example we try to store the previous context so we have this question user what is the average lifespan of llama and then assistant responds back which is what we put in here and then we ask what animal family are they because we have stored the previous context and sending it as product of the prompt we are now getting a better answer which is lamars are members of Camelot family which includes camels alpacas and the kunas and so by storing previous contacts in the memory and adding it as part of the prompt we are now able to tell llama that we are talking about llama as a family and and get the right response back from our from our model so this is what um basic chat conversation looks like now this is very simplistic obviously when you do real chat applications you will be able to store contacts there'll be a limit to the contact size because we have limit of 2000 tokens as the as the contact size but you'll be able to store like 10 15 20 last conversations and send it as part of your prompt to the Llama model so that you can have this in intelligent conversation with the model now let's go to the prompt engineering now as you can see from the previous examples we have curated our prompts before we send it to the server and there are ways where you can continue to curate it in such a way so that you can give additional examples which will help us get the desired responses from Lama here's a couple of examples in context learning which is zero text and few shot in zero short learning we basically don't provide any examples and the model is already trained in a certain class of examples and it will be able to infor infor a different class if you see this code um you'll see that we are sending in a prompt we are asking llama to classify a statement which is I saw a gecko and we are putting in a question mark after a statement or sentiment and we are expecting an output which basically tells us what kind of sentiment was it now our expectation is there are basically three types of sentiment positive negative or neutral and and when you execute this you will see that because Lama did not have any examples of that it will return back as cute now cute is an emotion and it's not a sentiment so what we are seeing is that your model llama model is not able to infer what the sentiment looks like so let's execute this in real time and and you can see the response back coming as cute now what do we need to do to give us a sentiment which is grouped across positive negative and neutral so the next example which you see here are by giving few additional prompts which tells Lama that okay here's an example of positive statement which is classify I love Lama as positive I don't like snakes as negative and I saw a gecko now llama is able to infer accurately that I saw a gecko is actually just a neutral sentiment now this is again very simple example but it's extremely powerful think about all the reviews you might have 5000 Plus reviews or 50 000 reviews you can actually provide it to llama and llama will be able to identify which ones are positive and which ones are negative and which ones are neutral and so by curating your in input prompts you are able to get the most desired output from Lama let's take another example which is the zero shot example here you're basically asking a question vikuna and expecting llama to answer that now to a human naked eye if somebody asks what is vikuna or vikuna as a question mark it's hard to answer that and Lama is doing the same thing it's answering back vikuna but what if you give some examples and can you postpart a pattern here that a question llama it says yes alpaca it says yes rabbit it says no and vikuna we are asking to infer what does that mean now when you execute this llama is able to say yes and what this means is it is able to understand that vikruna is of the same family as Lama and alpacas now this is extremely powerful because it is able to infer something which you haven't put in as part of the prompt but all you have done is Give an example and it is able to infer from that examples so that's pretty simple examples of prompt engineering but it is really powerful when you have to get the desired responses from lava let's look at Chain of Thought now what you see here is a word problem and we are asking llama to solve a word problem now inherently large language models because the way it works it is very hard for them to solve word problems so we are asking because it just predicts the next token so we are asking llama we are sending in a word problem which is asking llama llama started with five tennis balls it buys two more cans of tennis balls each can has three tennis balls how many tennis balls does llama have now if you take few seconds you know that we are asking for Lama side of the Fireballs two more cans each one has three cans so the answer should be 11. but what does llama return it returns eight tennis balls so clearly it was not able to solve this step by step logically and here's where Chain of Thought prompting comes in you can ask Lama with a one simple statement let's think step by step if you add that one simple statement you will be able to then ask Lama to solve this problem logically step by step so that's where the the statement is let's think step by step let's execute this and let's see the results now llama was able to figure out step by step that starts with five tennis balls buys buys two more cans of tennis balls each one has three so two times three is six six plus five is eleven now again this is a very simple example but you can think about very complicated algorithms you can supply it as a input to to your prompt to llama and llama will be able to figure this out and generate another algorithm which will be very similar to that and solving a pretty complex task so it makes it super powerful now prompt engineering is great but it has limitations there are two major limitations one is your large language models are trained up to a certain date and beyond that date you will not be able to get any inferencing done from large language model if your content is recent or recent news and and large language model will have no understanding of that second is it lacks specialized knowledge so if you have certain custom documents or extra or data sources which were not publicly available trained on and so you will not be able to query against those so there is a process a technique called retrieval augmented generation Rag and here's the architecture very simple architecture which shows you how to do that let's say you have an external data source and you want to query against that to get the relevant information and then pass it to your llm to infer more detailed understanding from that relevant information you can do that so here you see the basic architecture where user prompts the framework which we in this case we will use Lang chain and Lang chain connects to this external data source it can be a relational database it can be a PDF file or a set of documents you have to convert these documents into embeddings and store it in a vector store which I will show you in a in a few seconds on how to do that and once you get the relevant information from this external data source you can put it as part of your prompt and then send it to llama okay and this is a very important and excellent way if you have domain specific data which you want to query against you will be easily be able to do that using this model architecture so let's look at the code so we first initialize Lang chain and we basically run that now Lang chain is a like I said open source Library we import um we use a replicate server so our models are hosted there and then there is a four-step process the first step is to load the external data into our documents in this particular scenario we'll be using a PDF file which is the responsible use guide and we will query against that responsible use guide so it's a it's a simple PDF you can see the URL right there the second step is to tokenize and split the document into chunks the reason why we do that is because when the query comes in we want to find a relevant Chunk from the document and not query the entire document so by splitting it up into different chunks we'll be able to query one particular area within the document and query that information the third step is to use an embedding model now Lama or any llms don't understand work text it understands numbers so we have to convert text to matrices and vectors which the llm can understand so the step three we basically convert that into the embeddings and then the step four which you see there is we use the Facebook AI similarity search library to get the embeddings and chunks into our our Vector store now we'll start with like uh calling some of the prompts so I have a prompt right here how is meta approaching open science and this is coming from the PDF file the output which you see here seems like pretty good meta is approaching open science by open sourcing code and you see few sentences here now this is not exactly coming from the PDF file but it is getting the relevant content from the PDF file and then pushing it to your llm and getting the output back so then I ask a follow-on question how is it benefiting the world now you see this is again a multi-term chat which we saw in previous example we are storing the chat history here so it knows the previous context and now it will build on that previous context and give us the response back so we should see a response which basically talks about what are different ways the the open science is actually benefiting the word and and here you see it democratization of access increased Innovation and a bunch of other things so this is a very simple way you can query your set of documents and then use it in conjunction with your llama models to get the output back so that basically concludes how you would build a chatbot and how you would use Lang chain and prompt engineering let's go to the next step about how do you fine-tune your models now there are certain limitations with prompt engineering what if you have domain specific data which is massive right you're not able to put it as part of your prompts you cannot index it in a vector store because it you have a lot of domain specific data sets now you can use another technique called fine tuning where you can actually change the weights of the model itself directly and here is a simple architecture where you take a custom data set send it to pre-trained Lama model and you will get a fine-tuned llama model okay and and the the steps here there are many many different types of fine tuning you can do one is the parameter efficient fine tuning second is Laura and third is Q Laura which is a quantized model of Laura the whole goal here is to get a newly fine-tuned model which is serving which is which has the intelligence against your own data set now we are also using something called RL HF which is the reinforcement learning through human feedback once you have the fine-tuned model from your data set you want to continue to fine tune it with some human annotations so that you can reduce the losses it's more accurate you also want to make sure that these fine-tuning models have certain benchmarks and quality so you want to run these evals and quality benchmarks against those fine-tier models to make sure the accuracy is high and you can all do all of this by using pytorch pytosh is our open source framework AI framework which allows you to do pre-training and fine tuning it has a bunch of libraries which you can use to do the fine tuning let's go to the next section about responsible AI with power comes responsibility are these large language models are extremely powerful and you want to make sure that when you use it in your applications the output generated from them is safe for your users you want to make sure that you minimize the hallucination there is a chance that you will not completely eliminate the hallucination but you want to minimize as much as possible and you want to take care of the input and the safety output safety layers Lama itself is pretty safe but if you have if your application demands additional safety then you should be able to put that in there as well red teaming is super important as well red teaming is simulating real world cyber attacks so when llama was launched we actually did that exercise where we had more than 300 people both internal employees as well as external vendors work with llama to make sure it's extremely safe we had cyber Security Experts misinformation experts people from legal policy and so forth who actually pounded llama to make sure that the models the models is generating the output which is safe for people to use and we continue to refine it until we hit certain Benchmark numbers we have a responsible use guide link here and I have mentioned the resources as well I want to conclude this by saying that there's active research in this space every every week we see new innovation coming up new large language model coming up llama2 is extremely powerful make sure safety and responsible is important it's the central pillar when you think about building generative AI applications and I also want to do a call to action that this notebook will be available for you to use it it will be available on our GitHub recipes and please use llama in your projects this is a great starter code you'll be able to use this llama this code into your projects and give us feedback we really value your feedback as we think about building the next generation of the next version of our model with that I would conclude my talk my information is here please reach out to me I also have information of my colleague who actually helped me build this notebook as well moss in accent and his uh please reach out to him as well thank you three two [Music]
Info
Channel: Meta Developers
Views: 6,485
Rating: undefined out of 5
Keywords:
Id: Z5MFSlDrOdA
Channel Id: undefined
Length: 33min 33sec (2013 seconds)
Published: Wed Sep 27 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.