Streamlit- Web Scraping Project - to CSV - BeautifulSoup- Pandas

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to today's streamlit tutorial today we are going to scrape a website and it's about quotes so we're going to list the quotes here based on a theme for example if i choose by live here you will see that they change and if i choose some other theme like books you can see that i have some other quotes about different things now the fun part is actually that i can also generate a csv if i click on this so let me show you now this is the directory and there are no csv files here so if i go back what was that and if i press on generate look at this it is better to be hated by andrei gide now if i generate this and if i go back to the directory now we have this csv file and i can open it and you can see how cool quotes are here then we have the authors and we also have the links this is awesome so this is also very simple we are going to use this website called quotes dot to scrape so this is also very legal to scrape.com and this is the website basically so you can choose the themes here like inspirational and then you have these and yeah different themes and topics here so we're going to do that now in order to get started we should go to open our visual studio code okay i have this file st in this directory and let me just get rid of this quotes obviously it's open so i need to close it retry ok it's gone now so this file sd.py and what i need to do is to go to my terminal let me shut this down because it was open so i have this i'm inside this folder hacker news and we need to install a couple of things the first one is a streamlet so pip install in install streamed it and i press enter and for me requirements already satisfied so i don't need to do anything else now the next thing we need to do is also pip install request library pip install requests because we are going to send some requests to that url and get some response and also the last thing is pip installed beautiful soup 4 and beautiful soup 4 is used for like scraping while scraping and oh we also need pandas actually so tip install pandas as well pandas would be for turning things into a csv okay now that we've installed all of them let's import them here so i need to import streamlet as st so this is just a nickname or an alias that we give it so that we use sd then instead of streaming this is longer then import uh also pandas as pd so pandas is for data analysis normally but you use it now for turning things or data frames into csv and what else do we need uh import also requests and so request panda streamed it and oh beautiful soup so import uh b now from bs4 actually from ps4 import beautiful soup uppercase b uppercase s okay so we have imported these now uh let's start i have a full tutorial by the way on streamlet and how to layout webpages so you can refer to that if you don't know how i'm laying out these things okay now to in order to start streamlit and start a server so what we need to do is inside our folder we need to type in streamlit and the terminal stream what was that so streamlit and run and then the name of the file which for me is st.pi this is pi and it will try to generate a server like this place and it will open it up itself this is an empty or blank page made with a stream that here and there is settings options here you can see you can change the theme from dark to light i'm going to keep it dark and white mode and run on save i'm going to choose this because whenever i save my source code i want the changes to be visible as well okay now uh what else so let's start by laying out now what i want is what we had this here actually was a um what is it like a select box it was a box that you could select some options from so for that we can have a select box like st like streamlit select box so this is a method that gives us a box that you can select some items from and inside that we just need a label like choose a topic for instance and the next argument is going to be those lists the list of all those items so would be love for instance it would be humor life and the last one let's say books books now if i save this and go back to the browser you should see this select box here super cool you can choose these things from awesome but then we need to save this somewhere that is whatever user chooses it needs to be saved somewhere so that we use it later so we are going to give it a variable name like tag and this is going to be saved in tag so whatever the user chooses will be saved in tag but why am i going to what am i going to use it for so if you go back to our quotes to scrape you can see that here if i choose love for instance up here you can see we have quotes to scrape.com for slash tag forward slash love so we are going to insert whatever the user gives us at the end that is here so i'm going to copy this and go back and create another variable call it url now and the url is going to be this string but we need to change the love to whatever the user is choose right that is why i need an f string so that i choose i use a variable here inside curly braces and what is the variable name tag that is whatever the user chooses would be inserted here and it will work perfectly well now if i wanted to see what url is it is so i can ah let me see for example humor i want this url to print it here i can simply say s t dot write this is another method that streamlit gives us and just url now if i save and go back you see so whatever i choose like books you would see that tag four slash books humor tag for slash humor so now we have access to this web page whatever the user chooses nice so what we can do now uh we need to now send a request to this page to get some response back so now let's start by sending some requests because we are trying to grab these contents as well right so for that let me get rid of this url here we need to send a request to get a response i'm just call it i'm going to call it res for response and we are going to send a request by requests.get so we are sending a request to get what to get the url to get a response from the url which url this url and what is the response going to be let's print this out st dot write res now let's see if i save this one if i go back here i get this response 200 it means it was successful so what about life again 200. so it's a successful response that is our request has been granted now we have access to the resources or whatever there is on the website but we do not yet get the content so for that we need to be more specific how specific do we ask well we need to get rid of this we need to use a beautiful soup now so let's create a variable called content just to get access to the contents of that web page and we are going to use beautiful soup class to create that object and that is going to be res and the response that we got dot content so this is what beautiful soup gives us the dot content that is to grab the contents of that response and now the next argument is going to be because it's an html page so html.parser we need that as well so now we have actually the content the html content of that page what is this content going to be now let's check out it again by writing std.code this time i want it to be in the shape of format of code that's nicer and let's just pass in content here now if i save this let's see what this content is that we got look at this awesome html document so this is basically the content that beautiful soup gave us from that address you can see all this is basically here like if i inspect it i can see the same here html type all right so now we have access to all the content of this page but i do not need this stuff or this information what i'm interested in is these quotes where are these quotes located so let's see let's see okay inside this one there must be something like uh oh here look when i go over them you can see that on the left it is highlighted you see so all these quotes are you see inside a div with the class of quote so what we need to do now that we have access to all the content we are going through every quote one by one we are grabbing first of all all the quotes we are finding all the quotes and then going through them one by one like this and then once we go through one for instance we can find the text that is the quote itself we can also find the author which has the tag small okay now let's first of all let's find all the quotes remember div with the class of quote okay how am i going to do that now let me get rid of this part here so let's create another variable called quotes and that quote is going to be inside the content that we just got and i just showed you we want the beautiful suit to find all not thin but find all divs so div with the class underscore with a class of quote so find all means find all instances of div with this class and remember there is an underscore because this is not a python class it's a css class and quote so remember again we are looking for divs with the class of quote y because all the quotes are basically inside these so we're finding all of them now and now that we have access to all we are going through one through each of them one by one so what we do here is down below we're going to use a for loop to go through each so we say for quote like single quote in quotes that we just found these for all of them for every individual quote in quotes we are going to have what a text right so look inside this we had this text which is basically the code itself inside a span tag with a class of text so let's just call it text a span tag with a class of text now so text equals quote that single quote dot and this time we use find because there is one instance and it was a span tag if you remember and it had a class of text right so we have access to that text now next thing is the author where is the author the author here andre is inside a small tag with a class of author easy so let's also create a variable again and it's going to be of course quote find again it was again a small actually tag not a spam tag and the class was called author simple and the last one is the link where is the link this is actually the link which should be added to the url to this part so we and fortunately there is only one a tag here in this quote so we target a tag let's call it link and link is going to equal quote dot find and it's an a tag that's it so you have access to these three basically now let's just try to [Music] print them on the screen to see what we get so i'm going to use st dot uh success which gives me a green kind of background kind of thing and put the text there i'm going to have an sd dot just right very simple to put the author there and the last one is t again that well maybe text or code and put the link in there but these are not the contents of the i'm going to show you what i mean so if i save this and if i go back here so we have these issues we're going to deal with them anyways now the most important thing to know is that this one here you can see that we also have the tag itself and not just the content but i want the content not the tag so what i can do is to go back and add an attribute to these dot text this will give me the text inside the tags and also for the author as well now if i save this and go back now you can see it's solved so i have the quote i have the author and i have an a tag but i want only the href part and actually a nature of part added to this base url right so if i let me show you what i mean so if i copy this part and add it like oops a little like here to this you see and if i press enter it gives me this it gives me the some information about the author so i need to grab whatever there is inside the tag this one and add it to the base url so let's deal with that and how do i do that well inside the link here i don't want all the link the a tag i want the href so i use this uh square bracket that quotes href and this will give me only the href and if i go back you will see that now i have only this part which i should add to the base url to this url basically so let's just copy this and less here and let's use an f string and let's close it off here and why never string because i'm going to put these inside curly braces as variables and adding this here without that for slash now if i save this if i go back you see now we have the full link now okay this is much much better now all right so now that we have this let's create it as link so that when someone clicks on jane austen for instance goes to this page so for that i need to create a markup sorry markdown actually and let's just use it yeah i'll use this one here so let's just say mark down and what i mean is i'm going to add some html tags here some a tags so for that let's create again an f string and it's going to be an a tag so a what is the href going to be the href is going to be this link right whatever it is so i'm going to grab this all and put it inside single quotes this time so this is the href and let's close the single quote here um yeah this is a bit weird so hmm actually i don't need these anyways why would i need these yeah so this should work just fine without them yeah like this https and so this is the a tag let's also close it here and inside it let's put our author so that whenever someone clicks on it it takes her them to this page and now that we've done that we do need to another thing after these codes we need to let beautiful stream did know that this is okay to do this so because they don't allow normally html so i'm going to set this to true unsafe allow html set to true and let me get rid of this code as well and if i save this and go back i should see this awesome now so if i let's say open to new link you see jane austen here and this works perfect okay so we've done the important part now let's get to the csv thing so in order to make it easier for us to work with csv let's create an empty list here let's call it quote now underscore file it's going to be a uh both simple empty list here i'm going to add these this information takes other link to it so down below i'm going to say quote underscore file dot attend and i'm going to add those pieces of information as a list so text author and link so they will be added to this like one list after the other and these are going to be basically the columns and the rows of my csv file so now that i have this as well we get out of this for loop and let's also create a button here somewhere like here let's create a button to generate csv let's call it uh generate equals um s t dot button and uh yeah generate csv something like that if i save this and if i go back you see now i have this generate csv here and [Music] but i don't want the csv to work unless someone clicks on it right so i'm going to put everything here down below with the condition that if if generate that if someone clicks on generate do something so what should they do about it well let's have some try and accept let's say try this and if it fails well let's say accept i don't know oops accept this and accept st that right i don't know something like loading loading something like that and try this so this is going to be the code actually so we are going to create a data frame a data frame is in like a table in python in pandas that is something like csv file and in order to create data frame we need to use a pandas library that we imported up there pd dot data frame so we're creating data frame out of quote file out of this list of items that we have of the author link and and uh and text so let's put it here quote underscore file so now we're creating oops it should be indented we're creating a data frame out of all these lists of takes author and links so basically out of these so these are all going to be inside a uh a data frame and we are turning that data frame into a csv by just simply saying df dot dash csv and give it a name i'm going to give it this name of quotes dot csv let's see if we don't have it here yet okay we don't have anything so and we need to do several more things like we need to get rid of the indexing because the indexing would be one zero one two three i don't want that and then let's give it a header as well so that we know what each column is about and that header is going to contain um like quote column which is the first one basically for text and then the author we need to have another one called author and for the last one let's have just link so now we have three columns and also of course three here as well corresponding to each and [Music] that should be it i guess yeah so let's save this still you don't see anything now and here as well we don't have anything here so if i go here and uh let's just choose love for instance oh that's too much life yeah oh i want to just choose something which is short actually all right now let's generate csv if i click on this nothing changes here but if we go back to our directory now we see a quote that csv generated on here awesome so that is so cool but the issue is with this link as well we need to fix this and also this stuff is the result of some bad encoding i'm going to fix this as well so let's go back no let's delete this and let's see what the issue was so here instead of just link let's put href so we need the href of the files of the links and also for the encoding um like here yeah let's add an encoding encoding of apparently for me utf dash a doesn't work well so i'm going to use a cp c p 12 52 so if i save this one still no csv here so let's just go back and create one generate one now we should have one here let's open it and all good all good awesome look at this beautiful beautiful okay so you can you can see that now we have this header column author link quotes here uh authority links here very easy so that was it for today's video if you liked it please leave a like or subscribe or comment thank you very much for watching and listening
Info
Channel: Pythonology
Views: 7,457
Rating: undefined out of 5
Keywords: webscraping, pandas, streamlit, streamlit tutorial, beautifulsoup
Id: p2yNMfAWxOE
Channel Id: undefined
Length: 28min 18sec (1698 seconds)
Published: Thu Dec 30 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.