Discover LlamaIndex: Joint Text to SQL and Semantic Search

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is Jerry and welcome to the third episode of discover llama index where today we'll be covering a very interesting and important topic on joint Texas SQL and semantic search and so this was actually a recently released feature with a llama index and what we uh what what it covers is it showcases how you can combine the power of SQL over your structured data with semantic search over your unstructured data and you can combine that in one package that you can use when building your llm application so let's get started first is the taxi SQL side of things so if you think about taxes equal llama index has support for Texas SQL and this is a pretty popular stack for structured data the user would ask some sort of question in actual language it would go through like a text to SQL prompt essentially for from the language model and you can get more complicated than this you could build an entire system around taxes SQL you know be able to infer the table schemas uh store the table schemas themselves inside of vector store but fundamentally imagine some sort of prompt that you're sending to the language model that says here's a set of a set of schemas for my data collection here are some example rows and then given this can you generate a SQL statement given the query that's coming in and it turns out models like ppt3 gpt4 are pretty good at doing text to SQL and you can actually basically just like compile a natural language into SQL as a declarative language and execute the SQL statement against the SQL database and get back the final answer the thing about SQL which is kind of nice is that it's an extremely expressive language for uh tabular data you know if your data is organized in some sort of structured format it can perform aggregations drawings uh windowing a lot of these different things and some example queries that would work for this types of like uh uh for like a taxes SQL use case really are about like queries for structured analytics uh it's like the types of questions you want to ask to get some sort of like statistics or insights from your tabular data for instance some questions you might ask are what is the average population of cities in North America and what are the largest cities and populations and each respective continent now let's talk a little bit about semantic search over unstructured data for all of those of you who are starting to build llm applications inevitably you've kind of come across this basic stack which is you store your unstructured text in some sort of vector database so you ingest it from data source like a PDF or API you trunk it up throw it into a vector database whether it's like in memory or Pinecone or chroma or Eva or Etc and then once you actually have the data in a vector sort you set up some sort of retrieval augmented generation system that combines a retrieval model with a synthesis model so one incoming natural language question comes in what essentially happens is that you would first perform a retrieval step of embedding the question and then using that embedding to look up the top K most similar documents from your vector database retrieving those similar documents and then given those retrieved documents feed it into some sort of synthesis model which uses the language model to generate a response given the context that's interacted into the prompt and so this is the basic stack that's emerging for uh kind of like asking questions over your unstructured data and if we just like kind of take a step back and think a little bit about uh what it like uh kind of the the benefits of like asking questions over unstructured data uh there's certain types of questions that require making sense of just like unstructured text and and maybe it's not super obvious about how to convert this text into a tabular format to begin with um and so for instance like if you want to ask a question um describing like a certain scene or a certain event in history uh or or um asking about a certain event those are questions are well suited for semantic search some example queries here is tell me about the historical museums in Berlin like don't just give me the number of them like actually describe in plain English like you know exactly uh what what museums are there as well as the history behind them and then another question is you know from The Great Gatsby what does Jordan ask from Nick on behalf of Gatsby and you actually have to go in and retrieve the actual dialogue from this conversation in order to get back an accurate answer so I think one question that we've been asking ourselves that motivated the release of this feature is what if you could combine the two somehow combine the power of an expressive sativity of SQL over your structured data with the unstructured contacts that you can get from uh you know storing unstructured tax on the vector database so that's how we propose this new feature called Joint taxes SQL and semantic search for instance let's say let's just take an example use case for this and this example use cases tell me about the arts and culture of the city with the highest population and this is the question that you want to ask over your data and ideally you want to you know build an LM application that can satisfy this query let's walk through some of the expected steps maybe that a human might do when when they actually want to fulfill a query like this so first uh you want to query for okay what exactly is the city with the highest population this is very suited for for instance like a structured analytics query like toxic SQL because it really is just about you know getting kind of like a statistic from a table assuming you have that table data of cities and populations and so you could just do something like you know select star from table where uh uh you know and and just do like order by population right and then let's say the response here is Tokyo after you got back to study with the highest population you probably want to refine the original question because now you have more information so you want to say tell me about the arts and culture of Tokyo and this is a question that you actually want to ask over the vector database and and do semantic search um and and so you ask this question and the interesting thing is for a lot of these Vector stores they offer a more interesting interface than just kind of like a natural language query nowadays uh or or an embedding lookup you can actually specify for instance like metadata filters like basically structured tags in addition to the unstructured data so that you can effectively like filter for Relevant content and so you can ask this question over a vector database and Supply the relevant metadata tags and get back a response you know some popular events and festivals in Tokyo include etc etc you could uh you know look this up from the Wikipedia article and then you want to combine all the intermediate steps into a final answer that you would surface to the user so if you're if you think about like Steps two and three it's roughly similar to for instance like joining you know like a two tables together within a structured database but here you're effectively joining information from your structured data which is you know what's the city with the highest population with uh the unstructured context within your vector database which is okay now that we know it's Tokyo can we fetch information uh you know from this unstructured Corpus to tell me a little bit more about the arts and culture scene here is a rough diagram of how this entire um you know flow could could that be packaged into a feature uh and and this feature actually is uh you know the feature that we released in Lama index uh and and it's a pretty robust uh query interface that can effectively join but uh you know your SQL database as well as your vector database as well as query each one independently if it needs to and so I'll talk a little bit about what this means but as the basic example given the example that we just showed in the slides let's say the query is tell me about the arts and culture of the city with the highest population you would first hit some sort of selector so given the question you first want to decide okay well like is the query like the first step for this question should we you know hit a SQL database or should we just directly go uh go just do semantic search over a vector database for this question because we need to know what the city with the highest population is we're going to Route this query to the SQL database the SQL query engine that we have in llama index I can then run text to SQL over this query to then translate this query into the structure analytics query uh you know what is the city with the highest population that can then run as a SQL statement against the database you got back the response Tokyo is the city with the highest population then you would uh come up with some sort of you know query transformation module step that takes in this information so far from the SQL database the original question and then you return a more detailed question what are some popular festivals and events in Tokyo that celebrate its culture and art notice that from this query transformation set you actually have you could actually just come up with the final answer if you wanted to and for instance if the query really was just something that you could just directly get from some sort of like text to sequel uh a query without needing to augment with any sort of unstructured context then you could actually just return The Final Answer without needing to go through the vector database so if the original question is what is the city with the highest population then by the time you get to this step you actually don't need to get to the vector database you already got all the results from uh asking you know performing taxes over the SQL database but back to the original question um you ask what are some popular festivals and events in Tokyo that celebrate as culture and art you would hit the vector database query Android and then you know this will actually hit the auto retriever module which was also recently released feature on llama index that I can take in this query and actually infer the right not only just like the query string but also the metadata filters that you need to supply to hit the vector database and finally once you actually get back a response from the vector database you would join that response plus all the intermediate steps to get back a final answer so without further Ado let's walk through an example notebook just to Showcase how this works so in this example I'm just going to zoom in really quick uh we'll show you how to use um the uh we call this the SQL Auto Vector a query engine and we'll show you how you can actually set this up step by step so first you know we'll just do some basic Imports I'll restart the kernel really quick great okay so notice that you know we just import some basic constructs so that we can set this up the other thing we're going to set up is our Pinecone index so if you have followed our tutorials this is just the basics of how you can you know use the pine cone Library get your API key create an index and then just for the sake of completeness we'll make sure to clear it this part is just configuration you know we set the trunk size to 1024. um we Define the L1 predictor service contacts all these things and then we also Define a python Vector source so you know like pine cone is going to be used for our Vector database for our SQL database we're actually going to be using uh SQL Alchemy and so um we're just using a SQL Alchemy is basically a like python library that hooks up with with any sort of uh structured database and works using SQL like here so here we're creating this equal light database and here we're just going to create a sample table and so going along with the example from the slides we're just going to create a table called City stats that contains city name population and Country and so this is the structure table that contains all these population statistics oh okay let me just recreate this real quick and here we're just going to insert some three data points Toronto Tokyo and Berlin and the populations are around you know like 3 million 14 million 3.6 million my numbers might be totally off by the way but this is just for an example okay great and so we're also just going to load in the Wikipedia articles corresponding to each City we can do this with the Wikipedia python package and we also have a Wikipedia reader within llama index that can load in the unstructured articles for each City great now given that we have the data and the basic setup done we're just going to build the SQL index to contain to basically interface with this SQL table that we just set up and we're also going to build the vector index over the Wikipedia articles notice that here we actually set up one vector index and this Vector index contains articles from each City but each city has metadata tags associated with it so these metadata tags include the the title and and we insert the the city corresponding to each column next up we're going to Define um queer the query engines for both uh the sequel uh the the Texas sequel engine as well as the the vector Store Android so here we're defining the SQL query engine and here we're defining um a custom retriever uh query engine of your vector store and this specifically uses this thing called a vector index Auto retriever so there's basically two ways of interfacing with the vector store and llama index one way is uh just pretty simple you take the question uh getting the batting from it and just directly do some sort of embedding lookup from the underlying Vector sort the auto retriever is a step up in sophistication in that given some sort of info query you can actually infer not only just at the input query but also any sort of additional metadata filters that you might need to supply in order to filter out the right information from the vector store so certain providers like you know Pinecone and chroma have this already built as an interface and the nice thing about the auto retriever is you can actually use the llm to then infer the right tags to hit the factory store with this is nice because this allows us to do better drawing capabilities with the SQL database when we you know subsequently hit the vector store after getting back some initial results from the SQL table the next thing we're going to do is just wrap these two things as uh tools and a tool you can kind of think about it in the context of like agents and tools uh but this effectively is just letting it be used by the standstream query engine cool after all the setup we have some of the good stuff here so we're going to define the SQL Auto Vector query engine right here takes in the sequel tool the vector index tool and then also some additional configuration and with that said we're going to run three queries one is a query that does make use of both the SQL database and the vector database and is exactly the same example that I showed in the slides and then we'll also run through some examples where you might actually only need to hit the vector database for the SQL database so let's run through this example right so query is tell me about the arts and culture of the city with the highest population and first we'll query the SQL database because we are able to infer that given this question we first need to figure out what the city with the highest population is notice that the SQL query is Select city name population from City stats order by population and then you get back a response right Tokyo is a city with the highest population this part is probably hallucinated by the llm because that this part is not in the SQL database but nevertheless you know you have this information and going along with our diagram right here we then get a new question given the SQL response what are some more specific art galleries museums and festivals in Tokyo you then query the vector database with this some specific art galleries and museums in Tokyo include you know the nazim museum the National Diet Library National Archives and finally everything is combined to give you back the final response this is something you can definitely try yourself another nice property of this is that you actually don't need to have for instance like exact matches between your vector database and your SQL database because a lot of this is just like inferred by the llm and so just like you know given any sort of rough relationship between your data uh and like a SQL database and Vector database you don't need to set up like an explicit data model we could have the llm infer that for you for completeness this is the final response the next example I'm going to show is um what if like you know um you just want to ask a question that doesn't really need the help of the SQL database really like this this information can just be found by doing semantic search over your documents so tell me about the history of Berlin remember the selector in the beginning it's able to actually detect whether or not this question should hit the SQL database first or just go directly to the vector store in this example notice that the first step is actually just going directly to the vector store and then once we actually go to the vector store we can just fetch the relevant context to get back the useful useful context for answering the question great so this is the final response that you got finally let's ask a question that actually only requires the SQL database and doesn't require the vector database at all so remember in our diagram once we hit the task to SQL database we can actually have some sort of additional query transformation or processing step that decides okay should we then translate this into a question that we want to hit the vector database with or are we actually good are we done we don't need to do any further steps because the question's already been answered so here we ask can you give me the country corresponding to each setting notice that okay we clear the SQL database and the SQL query is Select city name country from City stats and we get back a response Toronto's in Canada Tokyo's and Japan Berlin's and Germany and notice how we actually don't have a follow-up query to ask the vector database but because the question has already been answered great and and that's it would definitely encourage you to try out the The Notebook it's linked in the YouTube uh and and also in the docs as well and if you have any questions please hop on our Discord and feel free to shoot us a note thank you
Info
Channel: LlamaIndex
Views: 7,587
Rating: undefined out of 5
Keywords:
Id: ZIvcVJGtCrY
Channel Id: undefined
Length: 20min 8sec (1208 seconds)
Published: Mon Jun 05 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.