Building news aggregator web app with Django by web scraping in python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video I want to build a news aggregator or a bath a news aggregator is a kind of system that takes news from a lot of resources and aggregates all of them a good example of news aggregator is Google narrows Microsoft knows being news a Yahoo News and there are several others so you get the idea what a news aggregator is so let me quickly explain about the frameworks and technology that we are going to use in building this app so we will be using Django to build our wave I'm in the web part of our app and we will use some Python libraries to s crap the news websites and in our this specific app we are going to scrap the website of Times of India and Hindustan Times and we will aggregate news from these two resources so the plan here is to will first do a research on these websites and then we will figure out what Dave and what are the HTML tags what are the classes and what is everything which is useful for scrapping or news from those website and then we will write a script to scrap the news and after that we will set up our with page templates and fuse and everything of a Django site and then finally we will integrate everything all together so let's get started by installing the packages that we need first so clear this okay so the very first package that we need is it is ps4 so let me install this one okay so that is being installed okay so on my system it is already installed so it said a requirement already satisfied but on your system it will take some time depending upon internet speed and it will install the package next we need is requests so let me install that one as well okay so that one is also already installed on my system so it said requirement already satisfied but again on your system it will take some time and install it okay so this is the website of Times of India and I have did a research on it about in which deep the news comes and this is the I mean tag the news comes inside the has two tags I have taken they are briefs section 2 s crap the news because there are certainly a huge amount of news on their website and there will be a lot of information but we don't want to track all of them we will track very few of them that a kind of headlines for the sake of this video so I have taken the brief section of their website and I found out that the news I mean the headings comes under h2 tags so that is the information from Times of India website and if you want to write a script for that it should look something like this okay so this is the script that we can use to scrap the web news from Times of India so let me quickly execute it and test before heading over to other things okay so I will explain this thing first so what we are doing over here is we are importing requests so requests is the library that allows us to establish as TTP request and get contents from there and beautifulsoup is the thing that allows us to parse HTML content and data everything that we get after requests so and I will explain why I have used this and you already see that all the news is coming inside the s2 tags that's why we are using it and this line just does parse the entire content of requests just like an SDM a browser or html5 I mean supported browser would do so let me quickly execute this quote and then I will explain why I have removed those two three elements from the end so that is going well okay so before I do that let us see what actually comes inside a headings if you don't pass the last I'm in two three elements 3 4 I think elements so what actually comes is these two three things trending topics and I think popular categories and everything so there are other things that are coming into our I mean headings which are of course also using s to tags that's why they're coming so we have to remove them I mean pass them out of our news we don't want them to appear on all huge so we will use this line but it will it does I mean what it will - is it will remove it will take first 13 elements only okay so that is done let us now see our headings okay that looks good because in the end nothing else is coming then the expected news so let us try to print that okay so that looks good this is how our news will come and so I think that is that that's what we done that's where we are done with the Times of India and now let's head over to a scrapping the website of Hindustan Times so I also did this I mean research off-camera we are using India nude section of Hindustan Times and there I found out that news comes inside this heading for I mean ativ which has heading for class and they're all the news comes so we will use this information to pass the news and here is Horus crypt how our script will look like so let us see how our script looks okay okay let us print what we have got in our knowledge oops okay so okay so that should print our news okay actually there are a lot of I mean space between even text so that's why it is coming like this but we will take care of this from our I mean before putting this into our web page but if we get the news from the way peers over there so till here we are done with I'm in the first part where we have planned to to scrap the website of new channels and in the next what we have to do is we will set up our django server or we'll build that I mean templates views and everything and then we'll integrate everything altogether so let's get started with setting up the django server let me quickly clear this whole thing so before we start using Django we need to have Shango on our system so the way we install Django is by this okay so since it is already installed on my system it says requirement already satisfied but you can when you run this command it will take some time to installed on your system and then you will be good to go on the next step so in Django there is a convention that we create a project and then inside project there are several apps which serves the purpose so here we will have to create a project first so let let's call our project hackers friend Jango news aggregator and okay so this is the command that is when that is we use in Jango to create a project so so that's what create our project I could spend a news every eighth row okay so that is created okay and now I am inside the equestrian news aggregator project directory because when I I'm in executed this command it created a directory of with the name of my project and then I moved into this directory okay and as I said we need to have apps inside all credit to work with so we will call our app news okay so this is the command that we have to use to create a new app so I am calling up our app news so that's the this should create the news app inside this project let us see okay so news app has been created okay let us move over there and see what is inside it so we see we have some files model stored py those are py test sauce py and everything so from here we are good to start okay on Jango side the very first thing that we need to do is to include this newly created news app inside the installed apps and we do that in settings toward py file so I have opened that folder inside vs cold and this is how it looks the better trace structure is something like this this is the parent directory and inside that there is a directory with the name of project husband a new Java Gator and there is a directory of news that is the directory of our app which we just created so the first thing we need to do is to include the news app and inside the installed apps so let us do that okay so this is installed apps will include the news yep over here so this would include the news app okay let us see that and if you are going to put that thing on to your production just remember to keep this divert turn off and you need to set this to false so that way Deba will be the information about your website anything anything will not be displayed to users when something it all occurs for the sake of this video I will just keep it to true because I want to see if something goes wrong so that is all we have to do with settings or py file and yeah now what we have to do is we have to create a template so in Django we have a convention of putting all the templates inside templates directory again you can I mean customize where you want to put your templates but I will leave it to default so we will create a template directory inside news directory and inside template we have to create a directory of our app name so let me create a new directory okay that will be the name of news and inside that we will create the first template index dot HTML ok so here we are supposed to write the HTML code and some Django template language code so now let me explain the thing that we want to do over here so this is the thing that should go inside the templates directory I will explain each and every line of it so don't worry about that I just want to save time so that you don't watch me writing the code that you already know this is very simple what we have done is we have simply created a kind of tape and in the normal HTML head and top type and everything and let me give it a title okay and what we have done is this is the bootstraps idea that we have included for the CSS and this is the Jumbotron class that comes with Buddha strap and inside that Jumbotron we have just given a name of Crispian news aggregator that is the name of our app and we have given a button to refresh the news mmm a new year we'll click on this button this will reload the page the home page and this is the container and inside the container we have created a row and draw is divided into two parts in Buddha strap we have a convention of dividing the entire container in two parts and I just want to keep two parts I mean in the left side it's so news from Times of India and on the right side it will saw from news from Hindustan Times so this is a kind of loop that is then Django template a specific thing so this TOI news will be a variable that we will pass from our views dot py file and this this is the variable in which we will pass all the news that we fit from in those ten times so all of that will come here and then we will loop over that variable and then print the news that we have got so this is fairly simple then there is a detailed due to your own Django which we have explained about everything how these are done and how these things work but for this video I will get this simple and this is the I mean jQuery CD and file and everything that bootstrap requires to work so that is our index dot HTML template so and after that we need to have some views from where we can solve this I'm in the template okay so in views dot py file we have to write mu so I will write that view okay and the location of our template is okay inside news and index dot HTML is the name okay and okay so that is the view and if we need to include this view inside the I mean the URLs from where the jangle serves all the files so we can do that over here we can create a new URL store py file for this specific app and then include that as well inside the main URL story Wi-Fi but this is a fairly simple app and we don't have too many apps and everything so I will keep things simple and include the view directly from here so from Neos okay we have imported the views of a news app now we have access to that over here so I want to serve the whole thing on home page so I will not give any URL pattern here okay I will give this your a name of home but that is not necessary we just keep it for simplicity okay so that's it from here now the thing which we need to do is we have to integrate the I mean the escaping part that death script which we wrote for scrapping those websites inside the views dot py file so let's head over there and we have the script that to be created and these are here so we created okay so we created this script for a scrapping the news from Hindustan Times and we created this script for scrapping the news from Times of India okay so let's put all of them into our views or py file so this is so this will come over here and okay okay so since we are going to make requests to both understand times and times of India for the sake of fur I mean variable naming I will name this tool Times of India request and this will be I think trims friendly assume under this will also go through because we need to do do this because if we will use the same variable them for I mean both Times of India and Hindustan Times then it will start I mean giving 0 or it will just override the previous very values with the new values so to avoid that we we are keeping both of these separate okay so I just went off camera and completed the code for this thing because this was going to take a lot of time so what I have done is I have taken all the values inside GUI heading and since all I mean this heading will contain a lot of information which we don't want so what I have done is I have created a separate variable with an empty UI news and in that toi news what I have done is I have iterated over the entire headings and then got only text part of it and then appended that into TOI news and I have done the same thing with that Hindustan Times as well I have taken all the HTML content inside this STI means soup and then we have figured out all the divs with heading for that we already talked about and from the headings we have to remove the first two values because they were coming with an M select C T and some other things that we did not want he will see that after I mean running this inside a for loop so I have removed those things and here I have created a again a new separate array and then I have appended the text part from every heading to that variable and now we need to pass all those things to our template so that it can serve so let's start that okay so that's done okay we have Court of illusion we have passed that to our template now I think it would be I mean working let us try to run that it takes some time to load for the first time okay so it is performing system checks okay don't worry about this migrations and everything these are the MN tables which django needs to create in order to save it min and everything which we are not going to take care of that as of now because we are not going to have any hitman side of it it is just a scrap the news from website and display it after viewed so let's just head over to our browser so there we are so that is the news that is the just gossip and news aggregator which we just built and if you click the Refresh news it will just reload the news from the website because new news is not changing that frequently so it is not appearing like this is being changed this is being changed but if you do that after a few hours I think then it will display different news so this is the news aggregator that we have just built using Django and some Python libraries and the coat the entire I mean the code and everything a step-by-step guide is available at hacker spend and there is also a git repository which is a link to that also is available in the article so that was it how if we can build news aggregator website I know this is a I mean fairly simple and not that good-looking website but you get the idea how do how can we I mean fetch the news from website and display on our way back so that's it
Info
Channel: HackersFriend
Views: 13,050
Rating: 4.9365077 out of 5
Keywords: web scrapping, python, django, news aggregator, django project, web scraping project
Id: gvdSkBmjpbY
Channel Id: undefined
Length: 28min 4sec (1684 seconds)
Published: Sat Jan 11 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.