Chat with Your Documents Using GPT & LangChain

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone and thank you for joining today's code long my name is reys and I'll be your moderator today we're going to kick off today's session in a couple of minutes we're just waiting so everyone has a chance to join in the meanwhile though we'd love to hear from you so let us know where you're joining from using the chat or the comments depending on what platform you're watching on and tell us something that you'd like to get out of this session uh this session is going to be hosted on Dat Camp workspace so if you don't have a data comp account already then make sure you get one set up uh you won't need to have a paid account to code along with us today um aside from that make sure you register for the event today uh that means we can send you the recording as well as any other resources that we share with people today uh and you can also head over to datac camp.com webinars to do that as well um you can scan the QR code head to stamp.com webinars and I'll also be putting a link in the chat uh again shortly brilant I think that's everything from me uh so I'll be back to repeat these messages for any new Joiner shortly but until then enjoy the background music and make sure you register and you have your data Camp account warm and ready hello everyone and thank you for joining today's code long my name is reys and I'll be your moderator today we're going to kick off today's session in a couple of minutes we're just waiting so everyone has a chance to join in the meanwhile though we'd love to hear from you so let us know you're joining from using the chat the comments depending on what platform you're watching on and yeah tell us something that you'd like to to get out out of today's session we are going to be using data Camp workspace today so if you don't have a data Camp account yet make sure you register for one and uh yeah we'll be posting a link so you can code along with us in the chat very shortly aside from that make sure you register for the event today you can scan the QR code on screen you can find the link that I've sent in the chat or you can uh head over to datac camp.com webinars uh don't worry if you if you can't register or you don't do it in time uh registering just means we'll send you the recording of this session as well as all the resources to go with it uh so you'll get access to the workspace as well as all the links that we'll be sharing in the chat uh brilliant I will be back to repeat this once more for any new joiners in the next minute uh but yeah make sure you're registered and make sure you've got your workpace uh your data account workspace accounts ready to go hello everyone and thank you for joining today's Cod long my name is reys and I'll be your moderator uh we're going to kick off today's session in about 20 seconds or so we're just waiting for the last few people to join uh today's Cod along will be hosted on data Camp workspace I'm going to be sharing a link so you can code along with us shortly so if you don't have a data Camp account already make sure you sign up for one now and yeah you'll be able to code along with us today uh if you can't uh stay for the entire session make sure you register and we will send you the recording as as well you can scan the QR code on screen you can head over to datac camp.com webinars or you can find the link that I've sent in the chat as well brilliant I think that's everything from me so now I'll hand you over to your host for today's session Richie Richie please take it away hi there data scamps and data champs this is Richie now one of the most annoying things I find about working life is when you ask one of your colleagues a question and they say oh yeah I wrote a document about it Go and read it so instead of getting a simple answer to your question you got to skim through 50 pages of nonsense just to find what you wanted generative AI provides a solution to this problem in that you can create a chat bot to answer questions about your document to save you having to read the whole thing this has the potential to dramatically improve your productivity so it's an important idea to understand and today you're going to build a chatbot yourself now one thing to note is that uh writing this code from scratch takes more than the 45 minutes we have available so today's session's got some bits where you can be reading existing code rather than typing absolutely everything and we've got two instructors for you today uh Andrea fenua is a Computing engineer at CERN so she spends her days writing software for the compact muon solenoid experiment and she's uh I guess hunting for Dark Matter there uh and uh Jose Farah is a data scientist and project manager at the Catalan tourist board and he also teaches the Big Data Masters program at the University of Nara and I really love that two data scientists with wildly differing day jobs I've end up working together through a love of chatbots uh so uh with that uh over to you Andrea yeah so thank you very much Richie for this introduction so actually um following up on your question we share the same background so we do um different daily tasks in our jobs but we basically did the same bachel and the same Masters so that's why we how we get to know each other and that's how we form for God's sake so as you introd introd have sorry as you as you have introduced um today we have a very nice use case um in this case in for code sake we have been receiving a lot of emails from midside company saying like we have a lot of PDFs and we want to train CH GPT on them and we have been um guiding them on how to do it and today JP will basically showcast how to do the entire workflow so of course for a concrete like service you need to scale up um the service you need to use like llm operations and so on but but today we will see the basic building blocks of the entire chain so this is actually our second webinar so the first one was in optimizing GPT proms for data science just in case you're interested in checking it out and sorry yes and um about this concrete case of study I think we have said everything so data we will use a set of PDFs and our primary goal is basically um load this data and make the gbd model use it and then to do so we have a secondary goal which is building a retrial augmented ation PIP plan which is the formal name for that and why simply because it's super useful and super demanded nowadays so with this I also hand over to josep and I hope you enjoy the session hi everyone uh thank you so much Andrea for this little introduction and uh I would like um to start with Rich's comment and actually this is super true uh sometimes when we want uh to understand uh a concept or when we want to learn uh new new stuff we need to to read articles and this is quite complicated right because uh people are lazy and today we have at the power of large language models and um this brings us a lot of new abilities on the table the main problem is that usually it is quite complicated um to tell these large language models um what's inside PDFs so basically we are here today to talk about how to chat with your documents or basically how to power a a large language model like GPT to understand what's within a PDF and then being able to ask with natural language to this model um questions about uh the PDF or the article so we have three main um objectives today first we need to learn how to effectively load and store documents using L chain what does this mean okay so I have a set of PDFs and I want to train my large language model um with these PDFs so the first thing I need to do is I need to load these documents and I need to communicate the information within this documents ments to my large language model then we need to build a retrieval momental generation pipeline to query directly these documents this basically means once I have my documents loaded and stored uh into a database I need um a way to be able to query um these documents and ask questions about these documents and finally uh I need to build a question answer B that answers questions about these documents we understand aot as a model that replies in natural language regarding the question you've done so basically once I am able to load and store these documents and I can ask a question and find this information within these documents I need to be able to create a natural language response having this information all right um so why are we using L chain so basically uh for all of those that still don't know L chain Lang chain is a collection of libraries in Python that are designed to help us create complex applications powered by large language models um the people that um use oftenly these large language models know they have a lot of power but they have two main problems first their output is not really reliable sorry to interrupt we cannot see your slides we can only see the I no sorry yeah I'm gonna move I'm gonna move here it's okay it's okay um um the thing is we cannot still um uh sorry that I lost a bit where I was uh uh um the thing yeah the thing is like um we need um L chain in order to um be able to communicate with this large language model okay so first of all we're going to start um defining the application we're going to do today uh this is here but to see it uh um a little bit bigger we're gonna try to explain it in the slides so basically the thing is imagine we have a set of PDFs this uh would be the PDFs so the idea would be okay I want to train a large language model with this PDFs train is super cost so we're not going to train actually we're going to extend the knowledge later I will explain so the main idea is I have these uh PDFs I need to load them in my environment so what I can do is load them but then I cannot pass these full documents to my large language model so what I need to do is just split these articles into smaller pieces of text so this is what we call chunking and once we have a smaller piece of text that are easy to be processed what we need to do is um we need to understand what semantic information this U this text can have so basically we're GNA embed this using an embedding in this case open eyes but we could use any other embedding and we're going to get the embedding um the embedding of these chunks and finally once we have this um data represented in a vectorized way we need to persist it somewhere and this is the vector store uh this will play a a key uh role Ro basically because they allow to store uh vectorized um text data and U they maintain the whole sematic information so basically this would be the first pipeline or the first first process that is loading and processing our document so we start from PDFs we split them down we embed them and we persist in a database in a database that is optimized for vectors called Vector store the second process would be talking with our documents so we are this user here here and we um need to be able to make a question a query regarding these documents here and then we should be able to embed this question again to vectorize it um and once we have this uh question embedding what we should do is make a semantic search so basically this means um that we want to check what information here is important uh for us and once we have this this is why I I said this is going to be a extended knowledge we're going to get a list of rank results so basically all the pieces of text in here that are the most aligned or are semantically more similar to the question I have made and I'm going to pass this to the large language model together with with the original question so we're going to take advantage of the natural language ability of GPT in this case or any other L large language model to generate a contextualized answer so um today's webinar uh we've decided to split it in three main parts the first one will be the longest one as well is understanding the basics of L Ching because basically to create such um an application we need a lot of components but then there will be a two final Parts first loading and processing our documents and it will be like just understanding what are the components that will will will be used to generate the first process and then being able to talk with our documents so now that we have this done let's start here first first of all we need to install uh L chain of course and we need to install as well Tik token Wikipedia ppdf F CPU and Pyon client so you can just execute the cell here and super important as well we need to have um first of all the open API key because actually we're going to use open eyes um both the GPT 3.5 model and embedding and then um we're gonna have an example using pine con API uh Pine con so basically um make sure you have the Pyon API as well um you can find a guided tutorial for getting both um API Keys here openi in this link and pinec con uh API key in here but I'm going to explain briefly how to get the pine con one because it's uh a service that not many people know so first you need to just check Pine Con on Google okay and you will go to the pineco website and sign up for free so you can make a free account and once you have this free account you need to create first a project okay um as a free account you always uh you're only allowed to get one one project so you need to create a project it will be like a starter project with a Google Cloud platform and then here we will find the API Keys okay so um the a um the pineco API environment name is the gcp starter and the key itself is this value here you just click this button and it will appear okay so let's continue um here we have all the Imports but no worries we will import all one more thing can we just close the file browser down just so um the main notebook's bit bigger if you click on files on the left left here okay and here now we have all the Imports uh we're going to do so um first of all we need to Define our uh API Keys all of them um you can use like the o library or you can just Define them like here okay we have this imported perfect so first of all we're going to start with Lang chain as I've already said Lang chain is basically a framework that helps us create um complex application using large language model and in this webinar in particular we're using it because uh it allows us to communicate effectively data to large language models this means we have a PDFs um and we want to send information of these PDFs to large language model mod before it was not uh possible but thanks to L chain we have different components and elements that allow us to pass this information directly to the large language model um it is super important as well because it has an atic uh Behavior but today we're not going to use it much but uh lanching basically has uh two main advantages the first is that is uh it use components so it gives like a modular approach to software development this means that this webinar has been performed using the open gbt model but as we are using a large language model class by longchain we if we if in a month we want to repeat this but using an open- Source uh model we can we can just switch um the origin of this model to another but the rest of the thing would work so we have a modular approach and then chains are super important as well because basically are sequence of actions that are tied together and allow us to uh create more complex application using a large language models then it's like super fast it's always up to date and then even though the documentation is not the best it has like a super strong community so uh we always um uh we can find any back or any error and we can always find like people talking about like the new launch of um L chain so the the components we're going to uh explain and use are first of all to load and process our documents we're going to use the document loader so basically these PDFs here uh we need to load them into our environment so we're going to use the document loader and then the text splitter once we have these PDFs we need to split them into chunks of text so we're going to use this component here and then um in both processes in both like the the loading the the the documents and talking with our documents we will need the text embedding to convert the chunks of text to embeddings of the chunks and then the vector store to persist the data and then when the user asks a question to embed and create the vectorized projection of this um query and then to to perform the semantic search for the second part we will use the large language model itself of course chains natural language retrieval metadata and indexes and memory this will be optional but we could add it so the first thing we're going to start with a large language model this is the basic of it all is the power of our uh service so what we're going to do is we're going to import the large language model library from from Lang and we're going to use open open AI we going to Define uh an llm element called CH GPT that will be an open Ai and we will um use the GPT 3.5 turbo and with a temperature of zero just in case um to avoid that our large language model just uh says non nonsense so if we execute um this cell here I'm telling him uh please tell me some funny jokes so that's what uh it will do uh nothing uh new uh for those that have already tried chpt or any another large language model um L chain allows us to segment these messages actually in three types and this is super important we can segment it in a system message this means a high level Behavior every time I'm I'm I am talking with a large language model I'm expecting it to behave somehow uh in a specific way if I ask about code I expect it to behave like a a senior coder but then it allows us to generate human messages as well so basically this means inputs from the uh from the user and AI um messages that means the replies of the model so for instance uh here I have the CH GPT instance I've generated I can define a high level for so for instance you are an aiod that help people decide where to travel I always recommend three destinations with a short sentence for each and then I can add uh additional context like an AI message like the large language model telling to me hello I am a traveler assistant how can I help you and a human message that this would be me writing where should I travel next and we can uh actually uh extend this with more information then let's move on to the text uh embedding here so basically the embedding means that whenever we are dealing with text text is quite complicated to process and deal with it so what we can do is make a projection a vectorized pro projection of this text that remains and keeps all the semantic information so basically the embedding what allows us to do is um it allows to get text uh data converted into a vector and that we can deal with this data but still keeping the semantic information so um starting now uh you will find different cells that have some part of the code lacking so this is because uh this will be tasked for you so in this case what are we going to start doing so the first thing we're going to do is we're going to import the openi embeddings so basically we're going to get from the uh embedding library of L chain the class open Ai and now we're just going to generate the instance of our open Ai embeddings and here we're going to write um any text just uh this is this is a webinar wait a webinar perfect and now what we're going to do is we're going to get our instance of embeddings and we're going to use the command embed query and we're going to put as an input the text here so by doing this what we are saying is I have this text this is a webinar and I will use the embeddings of open AI to vectorize this text and generate a vector this Vector we have a dimension of 1, 1536 this will be important especially later for pinec con all right um then we have chains as I've already said chains are um uh a sequence of actions that are tied together this is super important because we can actually um generate chains that perform the the actions that we want the most basic one is the conversation chain in this case uh it um it only allows us to interact with a um a large language model um by text like having a conversation with it but during the webinar we're going to use the load QA chain why because this actually is optimized for the question question and answering and we can input uh an input documents to contextualize an answer we will see later okay and then memory we're not going to implement memory actually in this webinar because um it would be more advanced but um basically the least minimum when we uh conversate when we talk with a with um with a bot is that uh we expect this bot to have some kind of memory and that understands what's happening in interactions the problem of today's large language models is that whenever we send an input this model doesn't um doesn't remember what's happening what's happened before so L chain to assess um this problem what's doing here is it's creating a component called memory that can just make a track of the interactions we are performing with our large language model in this case here we have the conversation summary buffer memory which basically what it does is it creates a summary of all interactions we've performing with uh the model itself but there are other options like just retaining the two last interactions or uh having a window with a summary and then the last directions and so on the only thing we have to um make sure here is that um whenever we're dealing with a large language model the number of tokens we're inputting is the most important because it will affect both the speed and then the the the the cost okay so for here we would uh again get the memory library from L chain and get the conversation summary buffer memory and we would use it I I I won't explain much about it because we're not going to use it actually but it is uh quite important so now we're going to start the most important part that is actually dealing with documents is the main focus of the webinar so any question uh until now or or yeah any clarification uh we have two questions from the audience so far and if anyone else in the audience wants to ask some questions then uh please do so in the chat now so the first question comes from uh Nanda Kumar saying uh are we going to cover reading specific chapters of a PDF or even specific page numbers so um that's an interesting question like um do you ever want to look at just bits of documents or would you generally look at the whole document U the thing is um we're going to see um we we want we will load the whole document but then in order to be able to process it we will break it down to smaller pieces because it's way easier to process and it's more more more effective so we will see it now all right super um and then next question comes from melie so Mel asked why is Lang chain providing better results than chat GPT plugins like the data analist or PDF summary or the customized version of um GPT while querying text well I wouldn't say say it's performing better it's just like L chain allows us to have like this modular approach that what allows us is to create a complex applications that um don't necessarily need to use a single thing like a single model or a single company we can just um take them and use them whenever we want but we can switch between models or we can switch between uh different kind of libraries so this is why uh L chain allows uh a way more open integration we can just use whatever we want and then we can just uh change it um whenever we are advancing in our application development so this I I would say this is like the most important advantage of Lang chain okay so Lang chain is telling you write better code rather than get better results necessarily get like not necessarily uh but yeah like um the the main uh choice of using L chain is this modularity approach all right super uh and with that um no more questions from the audience yet so I think you can proceed perfect so now um the main focus here is that we want to deal with documents so actually if you check uh in the files here we'll have a folder with different PDFs okay but the thing is like these documents not necessarily need to be PDFs so uh the first thing we need to uh learn is that the most basic document class that Lin has is actually a document and what's a document so basically a document is um um is a class that contains a page content that would be text and metadata so here we have a TX a task and it's like first of all from L chain schema import the document class this is the most basic uh document and we're going to create a document instance object right so we have here a document and we're going to say page content okay so this is a dami document and uh we have the metadata metadata usually has document ID so we can just set uh whatever ID D we want 677 and a document source so imagine my source if we had a PDF if this document were was coming from a PDF we would have the document Source here and uh like the document create time so we can have 0 one a random okay so now here we will have a document class that presents a page content that is this is D document and this like the text we have and a metadata with an ID a document source that would be like my source.pdf a document create time that would be like the random time with pick the thing is like this is the most um basic class but we just made it up so actually we don't want to meet it down we we don't want to make this up we just want these documents to come from somewhere and this is where the document loaders take a key role document loaders are an element that allows us to uh input some data either from the online or from the offline uh environment and get them into our code so the online loaders basically are Integrations that lanching has with important websites or um online PDFs basically that allow us to get information from online and from the internet and then the offline offline loader um are usually loaders that allow us to get U files from our system in our case PDFs right so let's exemplify this um we Wikipedia has um an integration with L chain and we have the Wikipedia loader so we can go to the L chain documents loaders library and import this Wikipedia loader and then what we need to do is create an instance of this loader called Wikipedia loader and we need to write the title of the article of Wikipedia we want in this case we're going to get the machine learning Wikipedia article and we're going to get it here now that we have the instance generated we just need to get the data so we're going to say Wikipedia mediacore data equals loader. load with the command load this is going to return a list of documents so actually these loaders already split um these articles or these huge texts in different documents I'm just going to print the first one so here you see a document element with the page content a huge text if we check here it says machine learning ml is a field of studying artificial intelligence if we check again if we check again the uh Wikipedia uh website it's the same machine learning ml is a field of studying artificial intelligence but then we have down here the metadata so here we have the title it's machine learning a summary and then a source in this case it's the website where is it coming from right and actually I can get to both of them so I can just get the first document of Wikipedia data and um take the page content or the first document of Wikipedia data and get the metadata so here we would have page content and then we would have the metadata here with the title and so on so basically what is doing this Wikipedia loader is getting the information we have in this website and uploaded it into our uh environment so we're going to repeat this procedure but with uh the hacking news loader and the P PDF loader okay so I'm going to um leave you a couple minutes and then um we'll do it together remember the the structure is the same as before so first we just need to get the the in the first guys the hent loader from the document loaders and we need to get the data from this website you have here here and in the case of the PDF uh we're using the P PDF so you need to um uh initialize um the instance of the P PDF loader with the path and then uh use the load command to get the PDF data okay so [Music] um let's go for it so the first thing is we need to import um the hacking news loader so we we have already done this and then we just need to initialize this um loader here stating let me uh here this website here and now once we do this we just need to uh set the loader and do load as the command the same happens with the PDF but instead of just I sorry here we need to write the stram yes the same happen with the p PDF loader but the difference here is that instead of using a website we we're going to use our own path here so first we're going to write docs and then we're going to upload the attentions .pdf document here and now the PDF data would be PDF um I PDF data sorry the loader. loadad dot load here okay um just you to know later we will use this um we can um load all the PDFs using the P PDF directory loader and instead of just getting one PDF we can get the whole folder here but for now we are not going to do this yet okay perfect so now we have our data and now that we have the already the PDFs loaded into our system we need to split them in the splitting there's like two uh key components the data chunks and the model tokenizer so um the large language model don't understand words and actually don't work with words they work with tokens so basically the tokenizer is the element that encodes all all text into um into tokens that is the natural um um data that large language models can understand and then the data chunks um basically what they allow us to do is we can break any text into Data chunks that is smaller pieces so to do so first we're going to import from the text splitter uh library of L chain the recursive character text splitter and then we're going to get the two tokenizer from the Transformers Library called GPT to tokenizer fast so the thing here is first we need to initialize our tokenizer this will be um getting from um the G gpt2 tokenizer fast from pre-trained gpt2 and then we're going to define a function that count tokens okay so basically this function what we'll do is we're going to input a text and this is going to return us the length of of the encoded tokens why because the text splitter allows us to um personalize what length we want to uh split our documents with so basically here we can just say we can get the recursive character text splitter and we can Define the chunk size of 200 and the Chang overlap of 20 and then we can just Define the length function that we have it here con tokens but actually this we can skip because by default it uses the Len uh function so if we execute this we're using the the Wikipedia data so from the original data we were taking now we will have 78 chunks this means that from this article or this article from here we are getting 78 chunks that contains each of them approximately 200 U tokens okay so now you need to perform the same but for the PDF data and then for the and uh for the hacking news data okay in this case uh for the PDF data you can use again the the function but for The Hacker News data we can just skip this function and let uh this with a default value okay okay so the first thing is you need to get the splitter and the tokenizer from the libraries and then initialize the tokenizer the function is already performed so initialize the text splitter as well and then just break the PDF data in this case or the hacking news data in this case with this Tex splitter here okay so the first first thing will be loading uh loading no sorry importing both the text spiter the reive character text spiter from the longchain tter library and then the GPT true tokenizer fast from the Transformers library and then here we can just get this and get it from the pretrained I from pre-trained sorry from from pre-trained and then gpt2 now we have here the function already um predefined so we're going to initialize our text spiter remember the text prer was an instance of this class and it has a chunk size of 200 we set a chunk overlap of 20 and a length function that we're going to Define as count tokens it is important to understand that when whenever we are breaking um documents or text uh we can define a length in this case 200 but it might happen that we can break words or even sentences so we can lose this uh semantic information this is why we always um Define a chunk overlap to make sure the semantic information is kept so no matter if we break a word or not that or a sentence that we will still keep the semantic information in the following Chun that's why we Define a Chun overlap here and then the PDF chunks is just getting the text splitter with the split documents command and and this case we're going to use the PDF data okay in this case we can just skip this two because we've already imported them and we will initialize again the text splitter but in this case we want set a length function we're going to use the length function by default and now we're going to use a here again the text splitter but in this case from the h and data okay and now we have 64 chunks for the PDF data and 232 for the hacking news data um just to make a little bit a little validation what we can do is we can make an histogram of the chunks we are obtaining in this case in the PDF chunks and we what we have here is that most of our chunks have a length between 175 5 and 200 so we are getting what we expect um and just to let you know um we can always split by chunks but uh we can split by Pages as well so basically whenever we uh import a PDF we would break the PDF directly uh by Pages Um this can be used for some applications that need the pages to be maintained as a Unity but usually um we prefer to ass set our personalized chunks length and overlap okay so now here we've already get the data the PDFs we have loaded them into our environment and we have broken them down into chunks so the next thing will be back door stores back door stores are basically databases that are optimized for um uh persisting um Vector vectors we have two types local Vector Stores um that um we're going to use f but are on our machine and online Vector stores that are cloud-based and in this case we have pine con so in here we first need to import um the library that is uh F we're going to import the F object from the vector stores lch library and then the open ey embedding this is important because we actually don't need to embed our vectors and then persist them when we are creating the database we are um giving the database both our data and the embedding protocol to follow so basically we are doing both things at once and um with pine con uh is a little bit more complicated because it's a cloud-based uh service so we first need to Define our API key and our environment name uh remember we you have like a guide to obtain it and you can obtain it directly on the on the website of fine con and then this is important um we need to have an index so if you remember if uh we were here we created a project we were here here you can create an index uh with the name you want and you can set the dimensions uh in this case it needs to be 1,536 okay because the dimensions we have here and um we can create the index as a free account we can only have one um you can create the index as well just using the pine con. create name uh with the index name the dimension and the metric to to use for the semantic search Okay in our case as we've already created it um we already have have it and now we're going to use U the natural language retrial so now that we have the vector database what we need to do is to get um uh the information from the PDFs uh using um the semantic search so basically this Vector databases what allow us to do is um now that we have here the the the F the the phase database what we can do is we can here should be like this what we can do is we can ask a question for instance can you please tell me all the outs of the article attention is all you need that is the article we've uploaded and we can perform a similarity search uh command but basically what this does is it embeds our question it checks the alignment of our equation to all the chunks that are contained in our Vector store and the ones that are more aligned are the ones that um this um command outputs us so basically it would be like the matches like all the text that contains information that might be of interest of the question we're performing this K here what makes us is uh it outputs the top uh K uh results so uh when I am asking this I am uh limiting to two uh results I am getting chunks of text actually so I'm getting the chunks of text of this um article um that contains information I am asking but I don't want this I want a natural language process in right so what we can do um well here you can replicate it with a pine con um so basically we can perform uh again um using um using the pine con Vector store okay but the important part here is that what we can do um I'm going to skip this part because we are a little bit rushed uh uh I wait ah here sorry no I here um what we can do actually is that um once we have this similarity search we can define a chain a load QA chain and what we can do is use this chain to generate me an answer that is a natural language answer so instead of receiving this here I am receiving now the outs of the article attention is all you need are and the names so this is why I I told you we are not actually training a large language model what we are doing is just passing using a chain the matches we found and the question our user is performing in order to obtain a natural language uh response okay so this is uh done with p with pine con you can replicate this using the f um Vector database uh the structure is the same but just using instead of TV Pine con you can use here the DB face for okay so basically what we would do here is import the chain again the load QA chain uh create our phase database Define our query again can you please tell me all the outs of the article attention is all you need and then find all the matches uh we need and then execute the chain and we would obtain the natural language uh information and the last element we need but we've already talked about it is basically the metadata so basically whenever we found these matches for instance I'm just getting uh here I'm I'm defining who created Transformers I am create I'm making a similarity search using the DB and what I'm getting here are like the matches so I'm just getting from the fourth uh match the page content the metadata and from the metadata the source and the page so basically here what we obtain is that this match here was obtained from the attention PDF we already knew that and from the P the page zero so from the initial page because the authors are usually in the first page right so um we can use this metad data to trace back uh the where the information is coming from so now that we have all done the part two is just getting all these elements together to load and process the data so we would have three main steps first the loader um if you remember if you remember the load loer is um the element that allows us to load the PDF to the environment and in this case we're going to use the PDF directory loader so I'm going to give you a couple uh seconds and we will continue so the first thing is we need to import right the pipe PDF directory loader from the L chain document loaders uh library and then we just need to initialize this loader but remember that this in this case we are just uploading the the the folder the directory not the not any specific PDF so we are going to write docs and slash here and the data will be defined with the loader. load great the second part would would be once uh this is going to take a little bit because it's a lot of uh it's five PDFs but once we have this data The Next Step will be chunking so we will break down these documents into little pieces okay uh so we need to get the splitter the tokenizer create the function to C the tokens with our tokenizer Define the splitter and then uh break down our PDF to chunks so again uh a little time to do it on your own and then uh we will do it together okay so the first thing will be getting the splitter in this case we're getting the recursive character text Splitter from the text splitter library of longchain and then we are getting the gpt2 tokenizer fast as a tokenizer so in this case we are um caring this uh initializing our tokenizer pre-trained and gp2 and then we already have our function defined here so now we just need to initialize our text splitter again with the chunk size of 200 and the chunk overlap chunk overlap of 20 and then the length fun function I sorry that I haven't initialized this here of course yeah the length function of um the count tokens here and then we just need to get the text splitter and split documents of our we've called it data so now we have our data here from our data okay oh we have an error I've written something wrong here yeah sorry now it is so now we have you see 610 uh number of chunks um corresponding to all the PDFs that we've uploaded to our system and now to finish the first part we just like the second part like the loading and processing we just need to persist all these data in our um database so again we would B pH and the command from documents and we would pass the chunks and the embeddings so in this case we are we are um sending the data and then the embedding protocol in order to uh generate our back store and we can validate if this is working by making a similarity search right away asking again who created Transformers okay in this case uh remember we're using F that is the local Vector store and we found for similarities and it's working so that's perfect so until here we would find um that we have already our data stored in our Vector stores and so we have loaded the PDFs processed them and prepared them to be uh reachable for our large language model so now we only have the second part that is talk with our documents so again we need to import the load QA chain that is the chain that allows us to send our large language model the coincidences or the matches of the similarity search and the query the original query of our user okay and then we need to perform the similarity search and generate a natural language uh response using this chain I'll give you a little bit of a time and then we will do it together for okay so the first thing is we need to import the load QA chain from the chains question answering library of L chain and then we Define our query again we're going to use can you please tell me all the out of article attention is all you need and then we will Define uh the chain in this case we're using the load QA chain so oops here the the chain will will be load create chain with the chat GPT that is our large language model and the chain type um that is stuff and then we're going to perform a similarity search using our f um Vector store and finally the response will be just executing the chain as input documents our matches the matches we found here and the query as the original query that the user is making that it would be can you please tell me all the authors of the article great and with this um our um our large language model would answer us in natural language the Alor of the article attention is all you need R and all the authors okay the only problem we still have is that we don't know where is this coming from so actually an easy way to implement a to trace back where this information is coming from is um generating an enrich prompt so again we load the chain we then Define the query can you please tell me all the outs we Define the chain again is a load QA chain and we perform the similarity search but now from these matches from here we get the input text and the input metadata so actually what we're getting here is from the vector matches uh for each element because this is a list we are getting the metadata and we are generating a new list and we are are going to create a prom called metadata reaching that will say the provided information has been extracted from zero that will be this input metadata please State infos sources both PDF and page in the response and now instead of sending the model just the query of our user we're going to get the prom the query together with this metadata query okay and now the model is going to tell us where is the source of this uh information great so from here uh we would still um be able to play and ask of course many other questions about other things uh we can even ask questions that are completely out of the scope uh and the large language model um should be able to tell us uh this information is not what you're looking for but it this is important in this case um as we are performing a mathematical operation to get the most a similar uh Vector um the similarity search will still have some matches so we would still be passing the model um as source for this uh answer and we can define a function in order to ask your model and um make this um more out meet so um this is all for um the the webinar um we started if you're remember uh seeing this application with two main processes the loading and processing of our documents with a set of PDFs that we load them we uh break them down and then we embeded into Vector store and the second one that was talking with our document um we've done the three parts of the webinar and we finally have uh these both processes is working we have a persistent data in a vector Store with uh our PDFs and a large language model that is GPT 3.5 that is able to generate natural language answers um that are contextualized with this Vector story to the questions that this user is making of course this is the most simple um B you can create but is is the basics and from here you can add like layers of complexity so you can add a memory you can add an interface and you can keep growing but the pipeline is um the one we've explained here so um this is all for our site and um again um me and Andrea are working we are a consultancy Duo but we are um currently working as well in a medium publication called for code sake where we share data science and programming insights um so you can go check it um in the coming weeks we we'll be posting uh a couple of Articles regarding this webinar as well large language model and L chain and you can follow us on the newsletter as well and now is the QA time so any question uh you have had or any question that Andrea hasn't been able to answer on the chat um we're here to to hear them wonderful uh that was a really great session uh and I love that I mean there was a bit of code written before but I love that you can get to like building that chat bot and actually getting answers from your documents like pretty quickly like within the time all right br we actually have loads and loads of questions from the audience it is coming up to time so um before we get to audience questions I just want to say that we do have a load of um webinars coming up so uh we are full into the swing of uh things again so tomorrow we got um a session on how to break into AI so if you're interested in career in this area we've got um sent La is really great speaker coming along tomorrow on Thursday it's me and uh Adele uh my colleague talking about uh data and AI Trends and predictions for 2024 so if you want to keep a handle on what's going on uh please do show up on Thursday and then next week on Tuesday uh we got a session on data storytelling with chat gbt so if you're interested in trying to turn the results of your analyses into some kind of coherent story uh then uh please do show up again on Tuesday all right uh we are going to run slightly over time with these questions so if you got a jump please do catch up uh questions on the recording remember to register in order to get the recording sent to you all right so uh going on to the audience questions uh there is one from uh shunan saying how does the retriever part work at a high level for example what are the strategies for finding the relevant interv so I think that was when you got like one of the top uh results uh from uh for my question how does it find what the best things are so basically what we do is we have this vectorized representation and we Define um like um the embedding basically what makes is like it generates a vector in a space of n dimensional um that retains the semantic information of the original um sentence right so the new query will have a um a vector that is a representation as well and what we perform is the distance between these two vectors in this end dimensional uh space so basically the lower the distance is is like the more semantical related they are of course we have like a lot of chunks so we will find the optimal chunk that is the most aligned to our question this is like how it works um that's a great answer Andre did you want to add anything to that no no I think clear like this question has appeared many times in the chat and yeah I've been trying to explain basically this that it's um based on distance it it's linear algebra magic once it's in the in in the magri for right hopefully linear super uh okay uh next question comes from uh KD saying what's the best way to do chunking um if um you PDF has a particular structure yeah so do you want to talk about chunking strategies in general like how do you go about breaking down the document into smaller chunks what's your advice then usually the the smaller the chunks are the most optimal is for the similarity search so basically um we tend to use a small length but of course it depends of the necessities and the requirements of the of the development itself but usually like the like the we tend to use um smaller chunks okay and I can't was it was it 100 characters or something he used for in this like um like how how big a chunk to use is and is it better to like split into like one sentence at a time a paragraph at a time I'm not sure and we usually Define it in tokens and it's like around 200 tokens 200 tokens like a couple sentences okay yeah and I think on that Jose also had a very nice plot so if you're chunking your data and you can actually visualize like he was doing I think it's nice to make sure you're using an optical an optimal chunking because otherwise like if it has a different structure and it cannot make it in the chunk size you want to to have then you will see in this plot a we distribution so try to find a sharp peig in the amount of tokens you're supposed to have in each chunk okay so maybe a little bit of experimentation just to see what's give the best results all right uh excellent so next question comes from Gopi so um here we used uh the open API we're using GPT so uh what happens if you want to use the hugging face API instead um exactly like uh as we're using the llm library of L of L chain we instead of just um importing the openi we can just import the hug face Hub um um API and we defining the name of the model that we have like in the huging phase um interface we can just get that model and use it so that's kind of the magic of L chain we can just uh switch one model by another and in this case instead of using the llm uh the GPT large language model we can use uh Lamar or any other open source of Hing face um all right as yeah this skills but like what you were saying about Lang chain is very modular so easy to switch things out nice uh all right so this comes from uh T saying what's the difference between training and embedding uh so we're not going to train chat PDF documents what is embedding exactly so I think you covered this but maybe once more just to make it clear yeah so the thing is like we are not training the model itself um what we are doing is we have um some data contained in PDFs or any other source we are loading it into our environment and um we embed this data into this vectorized project in in order to be able to make this semantic search so basically what we do is whenever we have a prompt we embed this prompt and we check the the most aligned pieces of text of our PDFs and the ones that are like more semantical uh similar and then use the large language model to generate a contextualized um answer so we are not training the model itself we're just using the ability to make natural language uh answers and using uh this extended knowledge or this uh words information we know uh we can use to answer our user all right nice uh we got uh oh we got loads more questions uh we we're going to run through for like maybe five more minutes we'll go through try and get through all these quick so um using the pi PDF loader um so um how would um content like tables and images inside the PDF be processed I think we just work with text today but what happens if you have other content no in the case of my PDF uh we only um Can process the the text itself for images and other kind of uh modes we need other libraries okay uh oh and yeah so how do you work with other text files what if you want uh txt or uh Word documents in there um so basically the the L chain loaders library has like many uh there's the documentation link so you can find um many different types of loader that are specific for different um types of um um of files of the text or whatever you want to to get into your your large language model the P PDF is only for PDF it's for PDF documents okay so just find the right loader and hopefully you're all sorted excellent uh so uh another question from nand Kar saying if you chunk text um oh and your content spans across different texts can the model subis multiple Contex I think this is like what happens if the answer lies across like multiple chunks what goes on there yes so um yeah so in this case I have also written uh in the chat I think we have to two approaches so the first one is try to avoid this uh from happening so for that just was showing the Chun of overlap so that we replicate information that it's at the extremes of the chunks in order not to separate information but in case it actually happened that's why it's useful to use thek parameter to get more coincidence from the data because if the same par there's a paragraph talking about the same and we split it in two and you said a coincidence like K to two you will most likely get the coincidence from both paragraphs and the model will be able to process both chunks that's why it's useful also to set the higher G parameter okay so um play about with the chunking options and hopefully that'll reduce the the problems um all right um so next question from Ed um can we load or prompt a questionnaire so oh right so have we got 20 standard questions I want to ask my PDFs so rather than just having to ask each question one at a time can you ask multiple questions how would you go about doing that uh yeah so in the end you can just Define a function and iterate over all the questions you have and repeat the process for each of the questions um okay yeah so for Loop seems to be your friend in this case easy peasy okay uh yeah what I wouldn't do is to uh put all the questions on the same um iterations to the same code to chat gbd because otherwise you will get a lot of coincidence for multiple multiple questions and then it will be difficult to assign and also you need to bear in mind the token limit so yeah yeah so ask questions one at a time rather than asking all 20 questions in in a single prompt um all right uh next question from uh the evil men uh so uh is the chunking idea connected to avoiding sending huge prompts to the API yeah so what you just want to go through like why you need to do chunking um like I I don't know if I understand well the question but the thing is like processing huge amounts of text is like not um optimized or effective so it is better to have like smaller chunks of text or smaller pieces of text both for the semantic search so the smaller uh sentences have like a similar semantic and are more accurate but then uh in order to of course yeah perform all the the ex is like more optimized yeah I think it's slower if you're sending vast amounts of text backwards and forwards to the API and yeah it's it's not going to find the right bits of the document for you um all right uh next question from an anonymous LinkedIn user so what's the process of preventing open AI using the data from our PDF for retraining their model in the future um well so actually with this uh methodology we' uh showed open AI doesn't have uh well the embedding actually has access that's true I don't think they uh store it anywhere with uh with embedding but uh if you use a cloud-based um solution like Pine con for the vector store you would uh should be concerned about how they uh handle your data but in case um you don't want your data to be stored anywhere you can always use an open source uh both large language model and embedding uh um uh function as well and H face their are so many and then you can use a local um Vector store like f so you can just avoid any any concerns you have or if if it's like sensible data it is better not to to put them in into any any cloud-based uh service yeah actually just putting this question back again so I think that there are two different um concerns here so the first one is like are you sending your data it's like going to a cloud provider and they've got a hold of it and the second one is like can open air use the content for r training the model in the future which they do with your content on chbt unless you to but for the API they're not going to reuse your content um for training future versions of GPT U that's in their terms and conditions I think I'm not sure about it I'm pretty sure something we we we've worried about a lot of data Camp so yeah you're okay using the API um all right uh we've got three more questions and we'll finish so this one's from Johnny saying what other search types are there available and why you to use one or other so we used similarity search here what your other options um so usually for the VOR stores we use the similarity search so I'm not sure if he uh you can change the metric actually you can um Define how this computation uh can mathematically is done but I'm not sure if there's any other way or command of Performing um similarity search um yeah there are a couple of different metrics and I can never remember what there like a DOT product variations and thing you it's not something you need to worry about too much unless you're getting terrible results and maybe how play around um I think in the AI codal alongs um that ree posted a link to there's one on how semantic search works and that shows you all the different options there so please take one of our code alongs uh all right next question um uh how can we simiz across chunks of data and how does the model Summarize each chunk of a time so I guess yeah one of the things is like you're trying to summarize longer document if you got chunks is it summarizing do you just do summarizing one chunk at a time or do you summarize the whole document and then chunk it what's the worklow there um but if we ask to Su like if the user asked to summarize a specific article or was it um like I think I don't understand the question sorry um okay I think it was just um if you're trying to do document summarization and Q&A on that is it possible to do both like then it we shouldn't do like the similarity search because actually what you want is like do a um a summarization of the whole article so the best option I think would be just getting the chunks uh of a specific source that it would be like the AR we want to summarize and then pass all these chunks to the large language model to get a summarization if it's too long maybe just try to get like some batching or something like that just not to over overflow the the the the model or the API but I think this wouldn't be the case for like getting natural language responses all right super um at this point we are well over time so I think we have to call it a day that so I don't get to absolutely every question um I hope everyone enjoyed the session I think it's such a cool use case is um that yeah I hope everyone had fun building their chat bot so many good webinars coming up I hope to see you again in future session so uh before that thank you once again to Jose thank you to Andrea thank you to ree for moderating thank you to everyone who asked a question thank you to everyone who showed up today see you in future webinars thank you so much for
Info
Channel: DataCamp
Views: 982
Rating: undefined out of 5
Keywords:
Id: 3XLstUOUgdE
Channel Id: undefined
Length: 76min 46sec (4606 seconds)
Published: Wed Jan 10 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.