Extract Topics From Video/Audio With LLMs (Topic Modeling w/ LangChain)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I guarantee you have a use case for topic modeling and if you don't well you can help millions of people that do topic modeling is the art of extracting groups of information from a longer body of text or a series of documents you know those chapters you sometimes see on YouTube videos well that's likely someone doing mental topic modeling reviewing the entire video and labeling the segments of the video they deem important same thing goes on for podcasts where's the opportunity well it takes a lot of manual work to go through an entire podcast or video to extract those segments and that structured data is really valuable to the right person if you could find a buyer for this you could create a productionized service for YouTube videos podcasts meeting notes legal documents movie scripts books lecture notes and many more for example if we check out the acquired podcast website we see that they don't have topics actually listed on their episodes that'd be pretty awesome if you ran this exercise gave them a couple episodes worth of topics and then say hey here's the price if you want your full episode list then you could rinse and repeat this for other podcast videos or really anything where there's a series of information involved the emphasis on my tutorials is to learn the ins and outs of building with AI in this tutorial we're going to go through a topic modeling method that I used while parsing information from the my first million podcast this tutorial was released to community members earlier so if you want to get notified about new content make sure to subscribe and sign up for the community in the description alright so for today we're going to take a two pass approach this is the method that I found worked best for my use case but you may want to experiment with your own so I'm going to run through the entire document via mapreduce and then pull out the topics and bullet points as my first pass so we're going to go through each individual token this may be a little bit expensive as you get through it so please look at your expense preferences as you start to build more then for a second pass I'm going to iterate through each topic bullet point and then expand on them with a subset of context that was selected via retrieval that is a long-winded way of saying hey I don't want to pull out a lot of detailed information within my first pass because I noticed it was a little hard for the L M to get me the topics and details in the first pass but so I'm going to split it into two so for the second pass I'm going to do a question and answer like retrieval uh with context but we'll go over that in a second here my assumptions is that you do not have a table of contexts or contents so if you did like for a book or a textbook or a movie well that would be helpful and you'd likely you'll want to use that but let's assume not because I want to make this as general as possible and then finally that you want to learn the nuts and bolts about how to do this you could go give a third party tool access to your data and then it could go do it itself but if you wanted more control over the process this is going to be helpful okay these are the use cases we talked about and if you want to check out the tweet that started it all uh I uh posted about it you can go check that out all right so let's get started here so first thing I'm going to import a bunch of packages I'm not going to go through each one but but if you have any questions on those please leave comments down below all right so for the setup I'm actually going to use two different language models here so I'm going to use GPT 3.5 turbo the June 16th Edition as well as GPT 4 June 16th Edition the reason why I'm doing this is because some tasks that we're about to do are good for GPT 3.5 we don't need too much reasoning power but some you know we want the extra horsepower and so we can use gbt4 I just like calling those out as llm3 and llm4 so I can remember which one's which next you're going to need your actual transcript that we're going to parse here now I put three different transcripts so you can experiment with different ones but we're just going to go through one today which is going to be the my first million uh Steph Smith episode okay let's load this one up then let's load up a sample and see what we're working with here so we have a speaker's name we have the timestamp which a sentence was said and then we have their actual transcript now I noticed that the transcript isn't 100 reliable but I'd give it about a 97 percent reliability here then what we're going to do is we're going to split our transcript and so this is going to be much too long to put in a single prompt the full thing at least and so we want to split it up now I'm going to load up my recursive character text splitter I'm going to set my separators and the first one I'm going to set is the double new line because the transcript is separated by double new lines or at least the speaker portions and so that'll be a good indicator for us now for chunk size I'm going to put 10 000 characters and this is so we have a good amount of information that's put in there now keep in mind that characters does not equal tokens it's going to be about four characters per token so it's going to roughly be about 2 000 different um 2000 different tokens here and then for the chunk overlap you can set this as what you want I usually like to do it around 10 to 20 so I make sure I don't lose any context or information but again this is going to be specific to your use case I encourage you to play around with this now for this exercise I'm only going to do a subset of the transcript and it's going to be about the first 23 000 characters the reason why I'm doing this because this transcript is kind of long and I don't want to have to go through the whole thing just right now for this tutorial but I want to show you that it works so let's look at a sub subsection right here I'm going to do that by first loading up my transcript which we had up above and then taking only the first 23 000 characters let's do this and so we actually spit our transcript into three different documents and the first document is 2800 tokens next up we're going to work on extracting the topic titles in a short description so what I want to do is I want to go through a mapreduce meaning I'm just going to process each one of those three chunks and then say hey gbt what topics do you see within this podcast transcript now the important part here is we're going to make a custom prompt and the reason why we're doing this is because the topics that I care about for this specific domain are kind of nuanced and I want to be able to do further instructions to the language model the out of the box prompts from linkchang do a good job for generic use cases but I really want to hone into these now this is going to be a point of differentiation for your products itself so when you do this I highly suggest that you customize this for your own domain okay I'm going to go through this first prompt just so you can see it and I can talk about my thought process a bit more you are a helpful assistant that helps retrieve topics talked about in a podcast transcript your goal is to to extract the topic names and a brief one sentence description of the topic I don't want it to give too many details and overload itself I just want a brief reference to it okay topics include and then I give a long list of topics that could be interesting to pull out from the my first million podcast they talk about a lot of things including business ideas interesting stories ways to make money etc etc and I wanted to know that that's what I mean by topics not its general definition of what it thinks it knows that topics are and then I go through some formatting so brief description I give an example topic and then colon and then brief description do not respond with numbers just bullet points I went through a lot of these based off of iterations and so there's no one-size-fits-all solution so you have to iterate through these and add your own then I gave it a bunch of examples and so this is things that I pulled out manually so that it would know what I was looking for Sam's Elizabeth Murdoch story Sam got a call from Elizabeth Murdock when he just launched the hustle et cetera Etc I do this so the language model knows the types of examples that I want to see and the types of language I wanted to use okay I'm going to wrap that up in a system map prompt and then for the human template I'm going to say hey here's the start of the transcript and this text placeholder is going to be the chunk text that we saw up above so for each one of the three chunks this will get placed in there and then we're going to wrap both of these up into a chat prompt template which has both these messages placed in them okay let's run that and so that was our map prompt and now we're going to have our combined prompt so what's going to happen is all the results those bullet points that we're going to get from the map prompt well there's going to be some duplicates in there right and so I want it to consolidate those and the way I'm going to do that is through a combined prompt and the emphasis here is to de-duplicate any bullet points that you see don't only pull topics from the transcript don't use any examples we have some examples here wrap it up in a chat prompt combined and then so for the first pass we're actually going to run through these and the way I'm going to do this is I'm going to use the load summarize chain now you may be wondering Greg we aren't really generating a summary and you're correct however I really like the load summarize chain just to kind of hijack the map reduced chain that it provides me here I found that the out of the box map reduce change from Langston is still a bit complicated so this one's just super easy for me the language model we're going to use here is actually going to be gpt4 because we're going to be using some more reasoning power to understand what are the important topics in art we're going to call our mapreduce method or on our chain type we're going to pass it our mapped prompt which is going to be what it's going to do via the um the first pass and then the combined prompt which is how it's going to do its consolidation okay I commented out for Bose equals Trooper you could do that if you want to all right we loaded up our chain and then now once we run this cell this is what's actually going to run so I'm going to skip to 402 when the video is done all right we found some topics here let's see what we have let's go through these I'm going to print these topics and so now we have a bunch of bullet points Children's Place based business ideas Sean discussed a concept of a membership-based children's play space but clarified he doesn't endorse it interesting okay Steph Smith's career Journey sampar shared how Steph Smith joined Trends and later moved to Andreessen Horowitz awesome so we have a bunch of topics that were pulled out from this podcast which is pretty cool because now we can start to make some structured data out of the unstructured text that would assist within the podcast itself all right the next step that I want to do is I want to convert this big long string which was a turn from the language model I want to turn that into structured data so we can go use it elsewhere more easily and the way we're going to do that is actually use the new function calling functionality from open AI so the way I'm going to do that is I'm going to define a schema here with a few different properties it's going to be a string and the description is going to be a title of the topic listed so this I just want you to extract the topic name the next one is going to be a description so this is the description that we see within the text right here so you can think of this as another extraction topic because I'm attracting more texture mirror and the last one we're going to do is what I'm calling a tag now this is going to be a string again and the description is going to be the type of content being described because if you go and look at this we have some different business ideas we have some life advice and life hacks or whatever maybe so we want to know if this is a business model life advice health and wellness or one of the stories let's run through this and let's see what our structured topics are all right so we got some structured data back we have our topic name children's play Space business idea we have a description and then the tag is business models because we're talking about a business idea right here which is really cool now the reason why I'm showing this is because the structured data piece is super important and the more structure you can give it the more valuable it's going to be to somebody else and the less work that they have to do so in addition to tag I encourage you to see what other ways you can qualify data and make it more structured for somebody else all right so next up we're going to move on to step two which is expand on the topics that we found so we have this topic name and we have a short description but what if you want a little bit of a longer description or a longer summary or you may just want to transform this into something completely different altogether and maybe it's not a summary the way we're going to do that is a little bit with the retrieval method and we're actually going to do the vector store Dance Now what that means is we're going to chunk up our transcript one more time but we're going to do it in smaller documents so what we want to do is we want to generate a summary based off of a certain topic but we don't want to do it on the full transcript again because that's going to be a lot of tokens I only want the the chunks that are relevant to the topic that we're talking about in order to generate more context on top of there now when I first came across this problem I thought to myself man that sounds a lot like question answering when you do a similarity search for your your embeddings and your chunks and all that so I decided to apply the same method here all right let's jump in and see what it looks like we're going to do our recursive character text splitter again but the chunk size is going to be four thousand now this is 4 000 characters which is about half the size that we had from our 10 000 up above and we have a chunk overlap of about 20 or 800 here now for the docs we're just going to split our regular transcript and this is still going to be the subsection of the transcript that we are looking at beforehand and let's see how many docks we have here so we have eight docs instead of the three that we had up above so roughly they were cut in half a little bit more we're going to create our embeddings engine and we're just going to use open AI embeddings but you can use whatever embeddings engine you want here now we're going to use pine cone and I had a heck of a time trying to get chroma and face to work before this so it wasn't working for me I wouldn't normally recommend using a remote Vector store for this light of a use case but substitute whatever Vector store you want here all right we're going to initialize pine cone and then my index name is going to be topic modeling and I just created that online and then we're going to create our doc Source from here so what it's doing is actually creating our index for us and it's going and putting the index up in the cloud for us so this is no longer local if you want to delete your vectors within Pinecone in an easy way you can just call out your index here and then do index.delete and delete all equals true and this will kind of just reset for you in case you ever want to just practice again we're going to skip over that all right now what we're going to do is we're going to do another custom prompt now the reason why we're doing this is because I want the uh similarity search from a retrieval but I don't want it just to answer a question per se I wanted to do a little bit of a longer summary so you will be given a text from a podcast transcript which contains many topics so because our chunks they're not just going to contain our one topic that we may want it could be more your goal is to write a summary five sentences or less about a topic the user chooses do not respond with information that isn't relevant to the topic the topic that the oops that the user user gives you and then here we're going to get the context now this placeholder here is going to be replaced with the chunks from the transcript that are supposed to be relevant to what the user says all right now this in the human message right here this would normally be the question that the user asks however what we're going to do is we're going to place this with the topic title and the topic description that we want because that's what I want similar documents to be found off of not necessarily a question and I left this as a question here because just for clarity this is the default namespace that and uses all right we're going to go through that and we're going to put both those messages in a chat prompt template based off these messages right here then we're going to set up our QA our retrieval QA so this is the retrieval part and this is question and answering but we kind of made our custom prompts up here so it's not so question and answering it's more well retrieval custom I don't know what you want to call it and then from chain type we're going to put in our doc search that we have and we're going to put our chain type keyword arguments now verbose equals true if you want to see all the magic happen in the background but here's the cool part because we're going to put in our custom chat prompt which is what we made it up here and then we're going to run through this so I'm going to say for Topic in the structured topics that we had Above This is the dictionary that we had I'm only going to look at the first three well I'll tell you what let's do a couple more let's do five I'm going to look at the first five and I'm going to say hey the query in this case is going to be the topic name and the topic description this would normally be where the question went but I want to find similar documents based off of this query not the question which is why I did this and then the expanded topic I call this expanded because we're asking for a five sentence summary up above here so this is the expanded topic and this is what we get when we do the QA dot run with our query in there and let's just start to iterate through this and see what we start to get here I'm going to print out the topic name in the description and then I'm going to print out the expanded topic so if we take a look at our first one here the hearing aids business this is the first topic that we talked about up here uh Sean and Sam discuss the potential profitability of the hearing aid business but if we scroll down we start to get more information on it so Sean and Sam discuss the potential of the hearing aid business noting that it could be profitable Venture they believe a direct consumer hearing aids could be a significant Market cool so now we started to expand on the information that we have here we have children's play Space business so here was the description that we had before and now we have a longer description so my point in showing you this here is you know a summary it's not so bad but you may have your own use cases that you want to pull out your extra information for and this is a convenient way to do it all right so we roll through here and there's more information about more topics all right now that is the end of our regular scheduled programming but now we're going to do chapters of time stamps so there could be a lot of instances where you actually have the time stamps for your transcript that you have and you want to actually pull out the different chapters further so it could be a YouTube video it could be a podcast transcript or whatever uh well we're going to do something extremely similar and we're going to do the same method what we're going to do here is we're going to set up another custom prompt what is the first time stamp when the speaker when the speakers started talking about a topic that the user gives it should be a question mark only respond with the timestamp nothing else and then I give it an example timestamp here so there's my custom prompt and just to remind you what this looks like so if I look at my transcript one more time let's do this it's going to be nice and long but let me print this out then what we have here is we have the transcript right here and we have what the person says so what we're going to do here is we're going to say hey language model here's where somebody's talking about a topic what's the smallest timestamp that you see that you see on here now I tried this a bunch of different ways some of which using the the function calling the open AI just came out with but really the easiest one was just go through it and just ask and it's pretty good of finding that out all right so here's our QA again we're going to go for it and this time we're going to have our custom prompt which is the new custom prompt up here and then our topic time stamps and so I'm putting just a placeholder here but what I want to do is I want to go through each Topic in each of the structured topics that we had up above which I think was something like 10 or 15. here's the query and in this case we have the topic name and then the timestamp is going to be the output of this retrieval process that we have up here and then what I'm going to do is append the timestamp which was found which is the answer from the language model and the topic name and throw it right in this list and then let's let's join it first into a larger list and let's sort it so we can see them all in order awesome so what we have here is now our timestamps that were returned from here so for the very first topic the hearing aids business and this is just the topic name remember it's not the description it the language model is telling us that this is the first thing that was talked about because it was basically opened up the subject here and then the next topic was the children's play Space Business Health and die hack Steph Smith's career let's go check this out and let's see what the first topic is I'm over here on steno.ai I'm going to play the first topic d2c hearing aids I think that's actually going to be a big deal and that's pretty cool that was the first thing that was talked about on there which is awesome all right my friend so that's my take on how to pull out the chapters or the topics from a transcript piece of text now I'm super curious to see what you do with this project yourself and so when you apply to your own domain please let me know I love getting emails from folks comments tweets whatever it may be I just want to see you build and see what you end up doing thank you very much for the join today and we'll see you later bye
Info
Channel: Greg Kamradt (Data Indy)
Views: 11,438
Rating: undefined out of 5
Keywords:
Id: pEkxRQFNAs4
Channel Id: undefined
Length: 17min 34sec (1054 seconds)
Published: Wed Jun 21 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.