Building a Chatbot with ChatGPT API and Reddit Data

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
chat CBT API was released just a few days ago at the time of making this video which is the same model underlying the chat triple product while being 10 times cheaper than the other existing GPT 3.5 models tools like chat TPT is trained on a lot of different data sources on the internet some of which we don't even know but what if we want to build a little chatbot that only uses your own custom data source something that you trust like Reddit threads so in this video we'll be building an ask me anything chatbot that answers all your questions about data science machine learning and AI based on the content from some popular subreddits in this area by the end of this video we'll be deploying an interactive notebook with the chatbot that you can share with other people with a simple link and with this chatbot we can truly harness the power of Reddit wisdom we go through a few different steps in the first half of the video we'll be retrieving the Reddit posts and comments related to data science in using Reddit API using this data we first do some exploration firstly to find out the trending data science topics on Reddit what is the sentiments and even people's emotions around topics such as chat TPT stable diffusion Etc in the second half of this video we'll take on the main challenge that is to build a chatbot using chat typical API and this Reddit data set we'll be developing this project using data load a collaborative data science platform from dead brains who has kindly sponsored this video without further Ado let's get started okay first thing first we need the data for this project I chose to collect data from Reddit because it has a great free API while Twitter's API is unfortunately no longer free Reddit is actually most popular in the US so in case you're not familiar with it Reddit provides a public forum for communities with similar interests to discuss and exchange ideas this was my first time collecting Reddit data so just like any well versed data scientist nowadays I went on chat TPT to ask for the source code it suggests me to use the python Reddit API wrapper Library I followed all the steps and got pretty pumped up that this is gonna be easy peasy unfortunately the code didn't work it seems to be using an old method from this API that no longer exists so okay here goes the same old research browsing stack Overflow and digging through the API documentation Reddit API is pretty cool it allows you to get the most up-to-date information the only catch however is that the maximum number of listings you can pull each time is 1000 there are some other third-party apis such as push shift API unfortunately I tried it and found that push shift API is not very stable in the end I went for the Reddit API which I think should be sufficient for our projects alright let's go to data lore website and create an account I'll sign in using my Gmail and you can choose the free tier here if you just want to try it out and remember that you can also do this project on your local computer the difference here is that it's a lot more convenient on dinner lore and there are many useful features that can help you get something up and running very quickly you can perform a complete data science pipeline here on data lore from querying data Eda model training to presenting results to stakeholders as interactive reports or data apps in addition to that you can collaborate in real time on the code with friends and colleagues and even schedule notebooks to run and update the reports automatically a unique feature of data law compared to other online platforms is that if you're working with sensitive data your team can host a private version of the law Enterprise on AWS Google Cloud platform Azure and on-premises this way your data doesn't need to leave your company's environment okay after you you've signed in we get to the project space on data lore you can take a look at some sample notebooks here and play around with them to have a feeling of how things work now let's create a new notebook and call it get red data we can choose the kernel here we are using Python and also you can choose the different machines here as well I'm using the professional plan so there are some crazy large computer options here which could be very useful if you are working on a large project or if you just want to speed up the competition we first need to install the python Reddit API wrapper Library you can use pip install as usual but on data law you can simply go to the environment manager to search for the package and you can simply click on this icon to install it unlike other apis Reddit API is somewhat difficult to work with at the beginning in the sense that you have to do some extra things firstly we need to go to the app preference website here and scroll down to the bottom of the page and click the ru developer create an app button so now we have a few fields to fill out we'll select script and for the name you can fill in whatever you feel like in this case I'll just fill in get data and the description is not required about URL also not required and finally redirect URL you can use the localhost 8080 to reference to your local machine after hitting submit you get to a screen which includes your client ID on the top left and your secret key in the middle you need to use this in a moment after we are done with this weird setup step here comes the good part getting the data to retrieve Reddit data we need to create a read Reddit instance that takes the client ID and secret that Reddit assigned to you earlier and also the redirect URL being the localhost URL and user agent being your Reddit username in my case it is hopeful contribution for a useful thing to know is that on data law you can also hide your secret keys so that other people who view your notebook can't see your secret Keys such as the API Keys you can do this by going to your account account settings and under the secret tab you can create your own secret key and then you can attach this secret key to your current notebook like so after you've restarted the kernel you should be able to access the value of your secret key using the OS environment method I find it really handy to make sure your key is safe for now don't abuse my API key okay now moving on to the first task we'll pull the post or submissions as how they are called on Reddit from the three popular subreddits machine learning artificial and data science it's actually quite simple I'll just take the example from the API documentation here and copy it here this piece of code basically pulls the top submissions of all time in the Reddit Dev and Learn Python subreddits there also other methods such as hot and new to pull the hot and the new topics that can also be interesting depending on your goals and we can also set the time filter to be in the last hour last day last week or the entire history we modify this good a little bit to make it pull the data from the three subreddits that we are looking into and we'll set the limit to the maximum which is one thousand okay after a minute or so this is done running if we take a look at the output you can see that it's a list of submission objects in the Reddit API documentation you can find all the attributes of this object you can see that we have quite some information here such as post ID post title URL that created of the post I'll go ahead and select a few attributes that I find most useful and we want to append all the post data in a data frame we can also set the limit to 3000 because we have three subreddits here so here we go in this post data set we have about 3 000 posts which makes sense because we queried 1 000 posts per each of these subreddits let me quickly convert this into a function to make it more clean and also write the post data to a CSV as well in the next step we are going to use these post IDs here to retrieve the comments from those posts Reddit API also has a method for this you can choose to only retrieve the top level comments but in this case I think it'll be more interesting to retrieve all the comments from a post so we copy this code and we can try it for one random post ID you can see that it's working so it's great now to get the comments for all the posts that we have or initialize an empty comment list here then we'll go ahead and create a for Loop to Loop through all the post IDs in the post DF data frame then we just copy paste the code in to this function now for each comment in the comment list we want to append the post ID and the comments to the comment list and finally we just convert this list to a data frame and save it to a CSV file I won't run this cell again now because last time it took me almost three hours to pull 200 000 comments so let's just pretend it's all done and the CSV file has been saved if you go to the notebook files menu you can see here we have two CSV files here one for the post and one for the comments you can double click on it to view the file in a larger window which is pretty nice now that we have all the data we need let's move on to the next part of the project just to separate this part from the data retrieval part we will create a new notebook called Reddit Eda and chatbot we first import some packages like pandas daytime matplotlib pilot Seaborn that we need for data visual session letter and we also import the Transformers package from hugging face that we will use for sentiment analysis later now we'll go ahead and import the post data sets and the comments data sets that we've just created with Reddit API oops I forgot that we have a different notebook now so I just downloaded the data files from the other notebook and upload it here there's a couple of small things we want to do before moving to further analysis looking at the post data sets you can see that the created UTC column is now in the timestamp format instead of the date format this is not so useful so I'll go ahead and create a create a date column that converts the date from the timestamp format to the date time format we'll also create a column to store the year of the posting data so this will be useful for analyzing Trends over the years in a bit and now we can view the data set and you see that we have the normal date time format here next we want to merge the post and the comments together by the post IDs such that we know little which comments belong to Wix post that's it and now it's time to do some more exciting stuff let's first take a closer look at the post data set and see what's going on here on data loan notebook you can easily explore data that you can do simple sorting and filtering by the column for example you can find out which post has the highest number of comments you can also take a look at the statistics tab over here it's really really handy you can see here the data distribution and the descriptive statistics for each column from my experience I can tell you this can save a lot of time getting to know the data there are also some little graphs here to show you the data distribution another fun thing is the visualization tab where you can plot out some graphs one thing I want to know is how these three thousand posts spread out over the years we can make a bar chart with x-axis being the created year and the y-axis being the count of the posts and the color we can pick a color or we can also choose a column to be the coloring so we can choose for example subreddit as the color we can see that our 3000 top posts are mostly in recent years which makes sense because data science and AI are quite Hot Topics these days they're also slightly more top posts in data science subreddits than in machine learning or artificial intelligence subreddits I think this visualization is quite nice so we can export it to a chart cell or as a gold cell with the code cell you can actually see the code underlying this chart which is very cool so we got to know a little bit more about the data now the next thing we want to find out is what are the trending terms or topics do people talk about in the past years we can do this by plotting a word cloud to have a feeling of that so I'll just combine all the post titles together in this post title text variable then we can create a word cloud based on those words in the in this post title then we can create a word cloud based on the words appearing in those post titles we can set the collocation threshold here to two so that we can cut the collocations or compound words that have two words like machine learning data science deep learning Etc it's no surprise that terms like data science machine learning deep learning and AI are very often mentioned in those post titles now I'm also curious what if we look at the word cloud by year how would the trending topics five years ago different from the trending topics today we can do this interactively so far we've only been using the code cells and the markdown cells but you can also create many different input cells such as drop down slider text input button etc for the years I think it's nice to use a slider so I'll go ahead and choose a slider for this and we can set the label of the slider and the variable underlying the slider I'll select here the minimum value of 2014 and the maximum value of this year which is 2023 and then we'll just subset the post if the created year is equal to the selected year and then we'll run the word cloud using this subset of the post titles you can see that in 2014 there was really not much we have some probably some people talking about Jan lekun's paper and some neural network stuff but it's really not much but fast forward to 2022 we have a lot of posts about his machine learning and also stable diffusion Dali 2 which was also really the highlight of last year and unfortunately we don't see chat TPT here for some reason I do see some very small words like as chat typically but it's really not much compared to other topics since people maybe like you and I have been really excited about the advancements of AI in the past few months I think it would be quite cool to analyze the comments related to chat Deputy or stable diffusion and see how positive or negative they are there is a lot of sentiment analysis models that are already pre-trained so we don't need to train them ourselves I found some quite nice models for from hugging face I'll use the model that was trained on the Tweet data set because it seems to be the similar type of Text data that we have with Reddit I'll go ahead and load the sentiment classification model using the Transformers pipeline module it will take a few moments to load in and we can try out this model for example if we say I love you the model classifies this as almost 100 positive which is correct and how about I don't love you okay this is negative so this is working you can play around with this to have a feeling of how accurate the model is and then just create a function to use this model to classify the text and output the sentiment of the text sometimes you might run into errors when running the model I think probably because if the text is too long I didn't manage to look into this issue so I just do a quick and dirty try and accept here so if we get an error then I will assign the sentiment as not classified now we want to filter the comments post DF data frame if the post title contains the word for example chat CBT so we have a few comments here and now we can actually create a sentiment column to classify the comments we can simply do a Lambda function here to get the sentiment for each of the comments now I just do a little visualization here to see the distribution of the sentiments for the chat typical comments a lot of them are actually neutral but we have quite some negative comments here and we have also a small amount of positive comments I think a lot of negative comments are probably just people debating with each other and get angry with each other so now we'll do a step further we want to recognize emotions in those comments on hugging face I also found a pre-trained model for this purpose of recognizing emotion in text so we're loading this model and we can try some some text here for example if I say ice cream is delicious we can see that it has very high score on Joy so that's most likely correct so similarly we'll also create a get emotion function that takes the text as input and run the text through the emotion classifier and then we'll sort the prediction scores from highest to lowest and just return label with the highest score now with this function we can also create an emotion column that classify the emotion of the comments if we take a look at the distribution of the motion we can see that a lot of comments are actually joy and quite some of them are also anger and sadness it might also just be that these emotions are most popular human emotions but to compare the emotions towards charity with emotions towards stable diffusions for example we can also do it interactively as well you can create a text input box here and we you can then let users type in the texts so letter in the interactive reports users can choose different words and different topics to explore their sentiments and emotions the exploratory analysis we just did are nice for General insights but an even more powerful way to utilize this huge Community generated data that we have is to make it available in a chatbot format to do this we can augment a large language model such as the gbt 3.5 turbo that is underlying chat CPT with our own Reddit data this is also called in context learning where we insert context into the inputs prompt and that way we can take advantage of the large language models reasoning capabilities to generate answers to our questions one simple way to do this with chat TPT for example is you can say the context information is below blah blah blah and then given the context information and not prior knowledge answer the question and then you can insert your question here however there is a problem with this approach that is there is a token limit to The Prompt that is around 4 000 tokens for chat TBT and similar language models like DaVinci so to pass on large context data like our Reddit comments we'll need to use a package called Lama index it separates from open Ai and this package basically helps us to do a few things firstly it will create an index of the text chunks from the context and when a user asks a question it will find the most relevant chunks and finally it will answer the user's question using the large language model that we Define and the most relevant chunks found in the context so in short this llama index package can help you bypass the prompt size limit with the indexing method it's also interesting to know that the package can also help you connect your large language models with many different external data sources for example from web pages also Google doc Twitter and so on this is very handy if you want to use information from these different data sources now let's first install lamba index and long chain which is another package that helps us interface with the large language models so let's import some of the necessary modules from these packages I'll spare you the details for now and we'll see in a bit what these different modules are useful for now we'll need to create the context file which is in our case just a large text file combining all the text from the Reddit posts and the comments that we have retrieved I think it's best to aggregate all the comments for each post and then we'll concatenate the post title the body text of the post and the comments together this way we can sort of preserve theological order of the Text corpus so we'll first select the three text columns here that is the post titles the post body tags and the comments columns and we can do a group Buy on the post title and the self text and we can concatenate all the comments together into a large chunk for each post this is how this aggregate data frames looks like it has nearly 3 000 rows each row for one post with all the comments finally we can create a combined text column by joining all the three columns together and finally we can concatenate every everything into this old text variable and let's quickly create a separate photo called Text data and now we can save the text into a text file called or text reddit.txt in this text Data folder now after this is done the next step is to create a function to construct the index from this text file this part is actually quite high level so I'll just paste this whole thing here I actually adapted this code from another project by Dan shipper where he built a chatbot from the archive data from from a well-known newsletter you can find the link in the description below I've adjusted this function slightly to use the chat TPT API which is the GPT 3.5 turbo model this function will index the Text corpus from a directory that we pass on to it and then save the index object into a Json file you can adapt some of these parameters here in this function such as the size of the output tokens in other words how long you want the answers to be and the maximum chunk size overlap and chunk size limits then we Define a function that's basically our chatbot that text users question as a string variable this function will first load in the index object that we created with the function above and then we generate the response using this index object and then we'll display both users question and the Bots answer now to run this create index function we need an open AI API key I bet many of you already have one if you've played around with chat TPT before if you don't have it yet you can go to your openai profile and generate a new API key we then attached this API key to our environment we'll take it as an input because we want to use the API key of the chatbot users not our own API if all of you are using my API key I'd be broke but for the demonstration purposes I'll paste my API key over here now we can run the construct index function we can copy the path to our text Data folder and paste it here to the function parameter this will take a few moments depends on how large your Text corpus is and after this is done you should see an index.json file created in your project directory be careful that this step costs money so if you have a large tax Corpus file just try to estimate how many tokens it is and how much you're expecting it to cost in my case it cost me just a few cents for this project so I think it's probably not often a big deal also chat TBT API is much cheaper than the other models so that helps now we'll create a text box for user to input his question foreign ly we can run the chatbots I want to ask a few questions here so firstly I will ask how to learn data science this is what you guys ask me all the time the bot says to learn data science it's important to understand the fundamentals of mathematics statistics and computer science and also strong understanding of data analysis their visualization and machine learning this is pretty good and I'll probably tell you the same and if I ask is it hard to learn data science the bottles say no it is not hard to learn data science this would probably give you a lot of comfort after all the hard work the final step is to share this project with others you can simply share the notebook by click on share and you can choose to give users view or edit access and then simply copy the link and share it another fantastic option is to turn your notebook into an interactive report where people can play around with the interactive elements like slide video and text box that we just created so once the values of those inputs are changed the cells that use those values will be automatically recalculated so to do this click on build reports and you can select the cells you want to keep in your report and rearrange them as you wish and when you're done you can click on update reports in this drop down you want to select the interactive report instead of the static report and voila that is it this is how you can create and share an interactive Report with others using data loan notebook there are also tons of cool ideas you can develop based on this project for example you can create a chat bot that uses your diary entries to answer questions about yourself and even helps you write your netting profile or a question answering chatbot based on an article or based on the content from your favorite YouTube channel like my channel I hope you enjoyed this project video and got some inspiration on how to build a chatbot using the chat TBT API AI you can find the data loan notebook Link in description if you decide to give their law a try let me know in the comments section what you think about it again if you got value from this video please smash the like button and subscribe if you haven't already to see more videos like this also check out other project videos on my channel thank you for watching bye foreign
Info
Channel: Thu Vu data analytics
Views: 22,564
Rating: undefined out of 5
Keywords: data analytics, data science, python, data, tableau, bi, programming, technology, coding, data visualization, python tutorial, data analyst, data scientist, data analysis, power bi, python data anlysis, data nerd, big data, learn to code, business intelligence, how to use r, r data analysis, vscode
Id: EE1Y2enHrcU
Channel Id: undefined
Length: 27min 35sec (1655 seconds)
Published: Wed Mar 15 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.