Python, NoSQL & FastAPI Tutorial: Web Scraping on a Schedule

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
i was recently reading an article from discord where they're storing 120 million messages a day with only four back-end engineers what's more is that was in 2017 now in 2021 that number is probably a billion if not tens of billions of messages a day with a really minimal staff so the question is how do they do this and the answer is a nosql database an open source nosql database called cassandra [Music] in this one i'm going to show you how to integrate the python web framework called fast api with the manage database service for cassandra called astra db now that's provided by data saxon they actually sponsored this video but the idea here is we want to see something practical now i could tell you all about the project itself but let's actually take a look at why i think this is worth your time by seeing actual code and how quickly it can adapt to the changes that we might need i want to show you a quick demo first off this is one of the responses and this is data that's stored in astradb it actually brings back this data and we can look it up anytime now it's not really that much data at this point but the cool thing is we can actually look at it and we can see all of the scraping events that have occurred now we're going to actually do this one time with a pre-existing data item and then we're going to go ahead and add a new one so jumping in to our jupiter notebook here i'm just going to run this example real fast with this data so what we will do is actually implement our own scraping client as well as methods to scrape a dataset from well in our case amazon.com which i'll show you where we're scraping that data in just a moment so it will take a moment to do this because we implemented a endless scroll feature so it'll continuously scroll to well the page won't scroll anymore and so the idea here too is to grab all of the data we possibly can and put it into a dictionary this is a lot of data and actually the thing that i'm excited about i'm going to show you in just a moment but one of the things with this is after i have all this data i want to validate it using pydantic to get only the data i'm interested in for my astrodb cassandra database models so then we actually add it in and store it like that so if i refresh in here notice that it's all stored i've got another event and that is pretty simple but this is the thing that i'm really excited about we look at this data set so this is the raw data from amazon.com it has a bunch of things that maybe i want to keep or maybe i want to store at some point let's say for instance this brand name right here if we look at our current data i don't have the brand name anywhere so i want to go ahead and add it so that means i'm going to jump into my cassandra models here so let's say for instance this model right here instead of title i'm going to go ahead and add a new one for brand and i'm also going to put this into my scrape event here also brand and when it's empty like this i don't have to add a value i also want to make sure that my schema the actual schema that i end up using for this will also have this data and of course we'll go over this once you get in there but the idea is we scrape the data the raw data into a dataset dictionary like we just saw so it's right here then we need to validate that data ensure that it's the correct data we want and then we'll go ahead and add it into our database so let me give this a shot i'm going to restart this kernel restart and clear output and again this is existing product that i've already script right so let's go ahead and run this and it will take a moment so i'm going to pause the video and it finished here's our data set we scroll down a bit here's our validated data now with this new brand here and here's it it is in the database i refresh this endpoint on my fast api again this is coming directly from astrodb this is it right there's the brand here and in some of these script events i actually don't display the brand i certainly can and that is also sort of easy to do we just go back into the schema here and we just add it where appropriate i think it's going to be right here and so we now have this schema we refresh in here and this will give us the brand when the brand is in there so of course this is when i actually added the brand all of my old values in here do not show up now if you're coming from like django or a sql database in a fast api i just did something that should be mind-blowing i added a new field and didn't touch a whole lot now i'll show you the actual code that ends up happening here and if we go into main.pi where i have my fast api application i've got this stuff right here where it's syncing the table so when fast api starts it's going to sync all of the database tables coming from these models right it's really that simple and when i actually added it in here fast api knew to restart the application itself right it didn't happen inside of the jupyter notebook at all it was all fast api so fast api wasn't running then i would have had to run these items in here inside of the actual jupiter notebook i would have had to sync this table after this session much like we see in this session right here that to me is just like so cool so easy it's also incredibly fast right so it's like not something i need to wait for both in terms of the speed of just writing this data and retrieving it but also just adding the new column it's no big deal so let's go ahead and grab a new product a product that i definitely do not have in here so products and there's not a whole lot in here right and a lot of them are test products there's not a whole lot real products in here so let's go into amazon and let's just find one so today deals and i want to find one that has a actually physical product so here's one this might be one worth monitoring because it's a deal for today so maybe the price goes down or maybe it goes back up but the key for this is i actually want to find hardware i'm going to scroll down here and i want to look for this right here so this is the data that i'm looking to extract here is the asyn i'm going to go ahead and go back into my demo here replace the asyn here and then we're going to run it i'm going to go ahead and do kernel restart and run all and then i'll pause the video and come back when it's done all right so it finished and in this case it does not have a price the price is not actually showing up in the way that i have this formulated now that's something you'll learn about when you actually learn how to do all the scraping but all of this other data is in here right so we've got all these kinds of data including the you know validated data and the event that was scraped so now if i go into my fast api application and just type that asn in i'll see all of that data as well so if i were to run it again it would also add in to what's in here right it's going to definitely add that data now it's not grabbing the price string because of how i have it set up but it is grabbing all of this data and the things that i might want right so another aspect might be country of origin perhaps i want to add that in let's just verify in my data set that i'm even getting that so in my data set if i scroll down a bit i want to see that there's a country of origin in here and there it is i did turn it into a slug version of the field name itself which i'm just going to do a quick on the fly change because python doesn't love slugs for field names so let's go ahead and jump in to our scraper here and we want to look for the slugify method for our data set here so here it is and so all i'm going to do is say key equals to the original key dot replace and dash and then underscore okay quick easy change restart and run the kernel and after a few seconds it finished and now we have the actual country of origin being a little bit closer to what i want so now into my models yet again i'm going to go ahead and add it into both the scrape event field and the product field and again just text we could add a default in here or just leave it in as text this is a default for all future values so go ahead and save that and then again in our schema here the country of origin an optional string both in terms of the script event and the list schema in this case so i'm gonna go ahead and save that and our fast api application should have actually updated it looks like it didn't so i'm gonna go ahead and run that it looked like it had a little bit of an error for this union class and i see why this should not be equals this should be a colon just like that let's just change those two okay no big deal and so now the tables should be synced i didn't do anything else other than save everything and change the fields in here right and so now going back in to my demo notebook here i'm gonna restart and run again and while that's running i can absolutely go into the product itself and now i see that the country of origin is missing in this particular model this is the main product model right these are all scrape events if i refresh in here once the demo is done i will absolutely see the country of origin updated now while that's happening i also want to look at the crud model here so i think what's cool about this is this is it right so if you come from the sql world you're probably not going to call a create every single time with the cassandra database and astrodb you can call create and it's going to update the primary key just like that look at that and then we have this new country of origin and of course i can run this as many times as i want or i can try and find another product right what is another product i might want to you know track let's just say this camera right here one of these cameras no i'm hoping that i'll get the price this time but it's certainly possible that i might not especially based off of well it's kind of hard to even diagnose where the price is right here so i'm going to go ahead and try let's grab the asyn from the url this time which is right next to this right there and of course you can always just search for that same number and again if the price does not come in it's no big deal you'll learn how to actually modify the scraper itself to find that price i'll let this run while it does i'm going to go ahead and jump over here take a look at i actually do not have that value found i didn't hit i don't have a server handler for the lookup that's not that big of a deal but now it should be in there what do you know it actually did get the price which is cool we refresh in here and there we go now i have an event that's actually did scrape this so again i can restart and run all again and i can do that as many times as i want in fact one of the other key parts of this is actually scheduling these things out running periodic tasks so that is definitely one of the things that we're going to do is run periodic tasks based off of the current products that i have those current products are going to be scraped using well a lot of the method that we have been using here which is also really really nice now something i won't demo but is really cool to have too is we can add this data using an api endpoint as well as in the actual scraped data set the data that's coming through right here this data set i could actually send into an endpoint and it will actually clean out the data that is necessary from that bigger dataset item too it's just it's just so cool i really really like using all of these things and i hope that you do too let's go ahead and jump in and get started in this series we are going to be using python to perform web scraping now if you don't know what that means it just means that we're going to write some code to open up any web page and extract data that we're looking for specifically and then we're also going to do this on a schedule which means our data is going to start to get really big so that means we're going to actually use a cassandra nosql database to handle this because it can manage massive amounts of data if you're web scraping on a schedule let's say every 10 seconds you're going to start to get a lot of data really really quickly and the cassandra nosql database can absolutely handle it now instead of actually starting to spin up our own database and actually manage that server we're going to be using datastax astradb now datastax actually manages and runs the open source cassandra project which is really cool this astrodb is really just their managed service for that and luckily for us it gives us up to 80 gigabytes for free that's amazing so i'm going to show you how to implement all of these things but before you do go ahead and actually sign up for an account right here they did sponsor this video but the idea here is we need to learn how to practically implement all of these things so to do this we're going to be using a number of tools now if you want to follow along with me you can go to cfe.s github and look for the scrape websites with python repository and if you scroll down here you'll see all of the tools we're going to be going through right so this is a massive series that has a lot of really really practical value so hopefully you'll get into it now before you do i do recommend that you have some experience with python so 30 days of python is my recommendation for that and at least up to i don't know day 15 maybe even day 12 but the idea is do you know how to do classes do you know how to write functions you know what strings are if you do you are good to go you are good to start with this one i'm also going to be using visual studio code so download that as your code editor your text editor you can use other ones if you're really comfortable with another one and then the next thing of course is using python 3.9 you want to use at least 3.6 but 3.9 is what i'll be using so let's go ahead and actually set up our environment i will say all the links for everything are in the description below and feel free to jump around to any section because depending on where you are it is a good idea to jump around now we're going to go ahead and set up our project into vs code so be sure to go to the repo which is linked of course in the description and open up vs code create a folder you want to store this now i put it into my own user's dev folder and we can call it whatever you want so like fast api no sql scrape something like that and then in here we're going to open that up and i'm going to run the command of git clone that repo link and period now this is going to clone the latest version of this code but this whole series is broken up into chunks right each section has its own branch which means we are in the third video right now i'm going to go ahead and do git checkout 3-start and that's going to give me whatever that is which includes requirements.txt so we're going to do that going forward now in the future you will also be able to do three dash end or whatever section you're on so if it's you know ten dash in you'll see the code at the end of that video now i actually already have this project on my system here and so i'm going to be using it in this folder so now what we want to do is actually create our virtual environment with the python 3 version that we have installed which my case is 3.9.7 the key is just python 3.9 and so we're going to go ahead and do python 3 mvemv period this of course is going to create our virtual environment and now in my case i'm actually going to update my git ignore file to ignore then include and lib just to make sure that that's in there and then i'm on a mac so i'm going to go ahead and do ds store that's another thing i want to ignore from our git repo okay and then let's go ahead and activate our virtual environment with source bin activate on mac and linux if you're on windows it's simply just script activate just like that okay so the reason we're using a virtual environment is really to separate the version problems that can arise when you are creating python projects right you just want to keep that version local to your local project thus you use a virtual environment but by all means if you're using pipbmv go ahead and use that if you're using poetry if you're using anaconda a lot of these things can work but using the built-in virtual environment is what i recommend at this point so now i'm going to just run pip install dash requirements.txt and so that should install everything that we'll need which of course that list is all right here now one of the things that might come up as a problem for you is specifically selenium so in this setup folder we have here there's a number of things on getting your system set up now i'm actually not going to go through this because every system is so much different like in my case with mac os i would just use brew and install chrome driver we need chrome driver to work with our actual system here and selenium the other thing is actually redis as well this whole guide here will get you through that now if you're not familiar with a markdown file by all means go into this guide on our repo the repo will have the most up-to-date code so of course you just pulled from the repo you just cloned from the repost there's a good chance that you had that and of course hopefully at this point you may have already asked hey i don't have git installed make sure you install git there are so many tools that you need to get to the point where making all this is incredibly useful for you but anyway so at this point i'm going to assume that you have all the requirements done and you have all the setup that you need of course we will come back to the setup if we need to in the future and worst case scenario was something like redis you could always install it on a virtual machine so you windows users will have a little bit harder time getting redis going it's not that hard but it is something that is not always that fun in setting up so if you do have questions let us know in the comments otherwise let's keep going now we're going to start the process of integrating python with cassandra using the astra db service so definitely go to this link right here and sign up on datastax for your free account now of course it's 80 gigabytes a month free that's really really good and so now we're going to go ahead and jump into the console assuming that you did sign up for those things and we're going to go ahead and create our first database in this case i'll just call it a fast api db and i'll give it a key space name of scraper app now you could always add another key space like scraper app test i'm just going to leave it in very simple and use scraper app now the key space itself will have the tables in it right it's not going to be in the database per se it's going to be in the key space which is in the database of course so you can have many different key spaces in one single database so if you ever needed to add more key spaces you can and of course if you need to change your database well you can't you have to just create a brand new one which is also incredibly easy um so anyways we've got our key space name and we can always add more later just keep that in mind next we need to choose a provider and region now this in my opinion just pick one that's close to you physically or the service that you like and the one that's close to you so if you're using google cloud for example you just want to get one that's close in my case i'm in texas so one of these east ones works just fine now if you're going to be using this in a web application like we will eventually and you bring that web location into production you're probably going to want to have the web application close to whatever region you end up using which all of that should be close to your users probably now there's a lot of caveats to what i just said but just keep in mind that you can always change this later or add a new one later we're just learning right now so let's go ahead and create this database now this won't take that long but it does have to provision of course so in my case it is currently provisioning so while it's doing that i'm going to go ahead and add a new token to my application so jumping back into my application i'm going to create a file called dot env so this is for managing our environment variables and if you're familiar with git this is not going to be tracked by default so in my get ignore i should have a dot emv file already in here as i do so that's what i want to add these credentials to i definitely do not want to share them because they'll have access to my entire data stacks you know astrodb configuration and my organization it's all admin stuff so let's go into our current organization here right click on organization settings go down to token management select role go to admin user generate token okay so these tokens we want to keep in those environment variables now you can easily just delete them later right you just come in here and hit delete so if you have to just go ahead and do that it's not that big of a deal so we'll go ahead and copy the client id here so this is going to be our first one and we're going to save it in the environment variables as astra db client id and we'll paste that in here no string is necessary and then astra db and client secret we'll go ahead and grab that client secret the token itself we won't really need to use but we might as well just bring it in here as well so astra db client took our app token i think it called it have token okay cool so this is really just combining the client id and the client part of the client secret but definitely the client id is in here as well okay cool so now that we've got that let's go ahead and check to see if our bundle has been created or in other words our actual fast api database has that been created looks like it has status is active so now what we want to do is connect it so what we're trying to do here is downloading the driver so we're going to be using cassandra driver so we need to go to python to download this bundle now just to verify that we already have cassandra driver in here it was in our requirements.txt so if you didn't install it be sure to install it now i'm actually not going to be using it just yet we'll do that in the next part but the idea here is just to at least configure environment variables as well as the location for this bundle so i'm gonna go ahead and download this bundle now and it's gonna be named based off of your database name right so secure connect fast api db.zip now i actually want to bring this into my project so this bundle we do not want in git at all we don't want to share this bundle in its current state i'll show you how to do it later but for now i'm not going to do that so i'm going to go ahead and create my folder called app inside of here i'm going to go ahead and create another folder called ignored the reason for that is to go into our get ignore file here and simply do app ignored okay we will change this later but that's what i want to leave it in for now with that i'm going to bring in my bundle over okay so my case i'm just going to rename this as just simply connect dot zip you do not need to unzip it you can just leave it just like that okay so this is the baseline of what we need next we're actually going to configure our cassandra driver all right so now we're going to go ahead and actually configure the cassandra driver so back into our database into connect and under python you can actually see all of the steps to configure this driver in some way but the key thing is we're actually going to be doing this step right here so let's go ahead and copy that and we'll bring it into our app folder here and i'll just call it db.pi and we'll paste this in so i think this is fairly obvious on how we're going to go ahead and solve this now the first one is the actual bundle itself that was our connect.zip i need the actual path for this and of course the reason i'm setting it up in this way is so that i can actually turn this into a fast api app later and that's kind of the key here and so what i want to do is i'm actually going to turn these into two different commands first off is going to be the get cluster command and that's going to be like this so i'll go ahead and put these in here and that's going to return the cluster and then we'll define git session in here and of course it's going to use get cluster and this is going to return this session okay so there might be more things that we'll have to add in here but for now these are the methods that i want to use so let's actually go ahead and implement this in some form and that way we can see a bunch of stuff i'm also going to go ahead and add in this row call here like it did and i'll just call session equals to get session this is just to verify that everything is configured correctly so of course if i do it now and i run python app and db.pi hit enter what i'll get is probably an error but it will hang for a little while because well it's trying to configure somehow um and yeah it didn't work okay so notice that it is connect.zip we never had to unzip that folder so i can actually put it in as a string like ignored slash connect.zip this should actually work now if i spell it correctly but unfortunately this is not great right i don't actually like putting paths in this way i would rather use half lib for it so we'll go ahead and do that we're going to import path lib and the first thing is we'll say base directory this is going to be the app directory itself which is path lib dot path of this file and whatever that parent folder is so that file is you know dbd pi parent folder is right there and then i'm going to go ahead and do my let's say my cluster bundle and this is going to be of course the base directory slash ignored and then slash connect.zip but ignore it as a string and then turn this whole thing into a string okay maybe at some point it will use path lib but right now it uses an absolute path to that string that's the best way to do it so we save that we run again this time it shouldn't say not found it should actually find it but then it has token errors okay so how do we actually get the client secret and client id now if you don't follow best practices you could just copy and paste these things in here but we do follow best practices so we're going to go ahead and import os and then we're also going to import dot emv so that package that i'll mention from dot enb import load dot env this is going to load up our environment variables so go ahead and just call it like that and that is coming from the python.emb right here and then of course cassandra driver is the other portion of that and so with that i can actually call what's in my env file so if i copied this whole thing came in here pasted it in and just went this equals to os dot environ dot get that key there then the next key like that notice i do not need the app token it's just looking for the client id and client secret which i have and so now i should be able to go here and here okay so we save that and we'll run it again this time we shouldn't see major errors and if we don't see major errors then we've configured correctly what do you know there it is cool so it's giving us our version and it is fully connected now there are things that i will want to implement later when i actually bring this into fast api but for now this is good this is all i'll really need to actually start modeling our data now that we can actually connect to a session it's time to actually start storing data in our cassandra database so inside of our app here we're going to go ahead and do models.pi now models.pi is going to hold all of the classes that we want to turn into both tables and columns within that table before we get there let's actually think about what it is that i'm trying to do so i'm going to go ahead and say data is equal to something that i want to actually store so in the case of amazon what we're going to be scraping we'll have an asn which is really the amazon you know id number and then we are also going to have maybe a product title so let's go ahead and say title and we'll go ahead and say mark one okay so this is the data i want to store granted in python right now it's just a dictionary and if i had a million of these well it wouldn't be efficient in python but it would certainly be efficient in cassandra so what we want to do is turn this into a class now the intuition there would be okay well let's take this class and we'll call it a product and then we'll go ahead and do asn and i notice that this is a string so i'll just give it a data type of string and then a title and i'll give that a data type of string as well so of course this is a python data class it's getting closer to being a python data class now in this case i actually want to say maybe the title is equal to none the default value is just simply none but the actual amazon number that id number is something that we want to use to look up this particular product later so in other words let's actually jump into amazon.com right now and let's go ahead and go to any product it doesn't really matter what product you end up using let's look at this right here i don't know what it is it's a generator okay so we're in this product and what we see at the very top is this number right here right every single amazon site has that number and actually if you scroll down a bit what you'll eventually see is some data about this product this is true on all products although the way it's laid out might be a little bit different and you see this a sin right there it is right there okay so that is the number that is for sure the unique identifier for this product you can definitely do all this research on it but that's the the idea here is this is the unique identifier so this one right here is how we're going to look up this product in our database now we're going to do all of that in a little bit but this is the key one this is the primary key yeah you see what i did there if you're coming from sql especially okay so the idea here is we want to declare this as primary key right so how do we actually go about doing that the title itself is not a primary care we're not going to be doing lookups based off of the title especially with a string right so doing lookups off of a string aren't nearly as efficient as doing it based off of either an index value or a primary key in this case again we'll go through these things as we go a little bit further so for now let's go ahead and actually implement this by creating our first model so we'll go ahead and do from cassandra dot sql engine or cql engine rather import columns and then we're going to go ahead and do from cassandra dot cql engine dot models import model okay so declare this to turn this into a cassandra model we'll go ahead and do that this actually turns it into a table and the tables actually exist in key spaces so what is our key space okay this is a big question now when we actually connect to our cluster that's going to connect to our database for us right so that's coming from that secure connect bundle is doing that for us but what it's not doing is the key space itself so we actually have to get that key space and that of course is inside of our project here and if you scroll to the bottom you've got your key spaces here you could always add a new one if you just want to test it out but here we go so there's our key space okay so now these two are going to be columns in our table so the first one the amazon shipping identification number i think is called is simply columns dot text and then in here we have options to declare a primary key and we can say required equals to true as well so this same concept is true for our title right so in the case of our title it is certainly not a primary key and it may or may not be required i'm just going to leave it as not required and we'll see what happens in a little bit when we actually start using this but this is a very basic model based off of this data right here and i'm actually going to leave it just like this one of the advantages of using cassandra is later changing things the things that we might not want to change or probably won't change are the primary key the way we're going to end up looking this up now what this is going to translate is to in product.objects.filter asyn equals to you know some amazon shipping number and that is essentially one of the ways we're gonna be looking things up what you're not gonna do is filter title equals to you know mark one that is kind of the key here of these two things now if we did turn this into an index then you would be able to actually start filtering this way but i don't actually recommend it i actually recommend for speed to use a actual identification number and we'll get back to that in a little bit later but that's kind of the idea so we'll go ahead and say not this def this okay so like definitely this okay cool that's our baseline model let's actually start using this data down or actually start storing data in this model all right so i want to start actually working with this product model but before i do i have to add a new item in my git session call here and that is related to actually setting the default connection and registering this connection this is so our actual model will work correctly so i'm going to go ahead and import it in here this is register connection and simply set the actual default connection as well this is just a quick and easy way to ensure that that product model will work so we register the connection with this session get the string of the session and then pass in the session argument itself and then we'll go ahead and do our set default connection as well and also that string of that session okay so now that we've got that i'm going to go ahead and open up crud.pi let's go ahead and make that module here and so the crud module what it's going to allow me to do is actually run various commands that relate to creating retrieving updating and deleting initially we'll just do create so we can see how it works so let's go ahead and import from.models we're going to import our product and then i'm also going to go from db we're going to import that session from db import git session and we're going to declare our session here so get session and then we're also going to use our product model so to do that we'll go ahead and say let's define create entry and it's going to take in a data class that our data argument as a dictionary and we're just going to do product dot create and then we'll unpack that data right that's it and actually i'll return whatever that value is which will be an instance of this okay so let's go ahead and save it let's run it into python and then i'll do from app.crud we're going to import the create entry method here and it will take a moment because it's going to boot up that session now in this case i get a 0 a key error of zero and that's because when in my database right here it's trying to execute that get session call right there so let's go ahead and save that and get rid of that comment and then just try this again by importing it we should be able to actually run it now there we go okay and so the data i want is very similar to what we had here right so we'll go ahead and say data equals to well we might as well just copy and paste this okay so there's our data and now i'm going to go ahead and try and create it which is simply like that there we go we've got table scraper app dot product does not exist okay cool so there's there's a reason for this and we'll get to it but what it showed us is that this product model the default table name is simply just product just that lowercase of this which we'll see throughout so in order for us to actually use that table we have to call something called sync table so we're going to go ahead and do from cassandra.cql engine dot management i'm going to import sync table and in here i'll go ahead and sync the table for the product itself okay so yet again we'll exit out come into python and we can press up a couple times to actually grab that import oops this should be cql not csql let's try that again and there we go so it should now import it and we have this warning right here that we can work with or change we just go ahead and put it into our environment variables and i'll just go ahead and set it equal to zero i believe so let's go ahead and try that again but seeing that warning actually is a good sign it's hopefully showing me that everything's working this is just related to the package itself and now it's in the environment variable it's gone okay cool so going back into this we should not have any issues now creating an entry so let's go ahead and copy that data again and let's call create entry on that data there it is and you know what let's do it again and again and again and again now if you're coming from the world of sql you might have already seen this before and you would sort of expect some well this result so if i did product dot objects dot all hit enter this is a query set but i actually didn't import the product so let's go ahead and import the product and try that again this is giving you a model query set which is really nice this is something we can iterate through and we can also just say list around it what it only has one that's odd but i called it like several times what if i just change this a little bit and put that in and now call that data command again okay so it's a little bit different on the a sin listed out hey now it has two what if i change the numbers here again and call it again okay looks like the title changed those things out there's still only two how cool is that okay so this is maybe expected behavior or unexpected behavior depending on your background in my opinion this is great this means that i can actually update a entry with that create call i mean that's exactly what's happening because it's going off of the primary key which is this and any change that you made to it now in my case it's changing the title so it's actually updating this and when i called this create function multiple times i'm actually using the exact same primary key every time so when i use the same primary key it will update other fields in here now this isn't always perfectly accurate for every possible thing but the idea here is this is how i would store the product itself right so any sort of detail about the product this is what i'm storing this is not going to be a regular occurrence right so what we want to look at is this same concept but storing multiple iterations the same a sin right but then a price look up an actual lookup event so let's go ahead and do that let's go ahead and implement one of my favorite things about using cassandra we look at this model right here and we're like hey i actually want a new column in here so i'm going to go ahead and do price string equals to columns and we'll go ahead and say dot text there you go you now have a new column and crud will work as it was which i think is pretty cool now of course if we wanted to add some attributes to that column we could go into the documentation for cql engine and look for maybe a default right so we can actually pass in a default healer either a value or a callable so let's say the default is equal to the string of negative one okay so now let's go ahead and try this out i'm going to go ahead and go into python go from app we're going to import crud and models this should take a minute but what it will allow me to do is something like this where it's product equals to models dot product you don't have to do it that way but i'll go ahead and list out product objects to all hit enter what do you know price string is in there it is listed as none right okay so that's important now let's go ahead and actually use crud again with some new data okay actually i'm going to change the id here as in testing one two three that's important and then we'll do crud.create data i believe that's what we called it create entry and use that data and enter and now it has a default value right so the default doesn't apply to all of the old ones there's ways to do that to actually apply to the old ones add a default to them or update them in other words but the default will actually apply to all of the future ones now what if i actually change the defaults to like negative 100 right and let's just exit out of here and try that again this time it's going to be that and i'll have to you know bring some of these things in so let's from app import crud and models go ahead and do data and crud.create entry of that data hit enter there we go our new default really really nice now part of the reason that we can do this just adding a column willy-nilly is because of how cassandra works but also because we're using this sync table right here so i don't have to run migrations i don't have to make a bunch of changes to the actual table itself i can just change them on the fly like this now it's not true for every column but for columns like this where i'm just adding additional context i can totally do it it's pretty sweet anything now we actually want to track a actual scrape event so very similar to this create entry we are going to want to have something like create scrape entry where it's essentially adding new items to our database because if you think about it the current product itself this current model is going to be updating based off of this async so if the asyn exists in the database it's just going to update all the other fields if it does not exist then it's just going to add it in that's not what we want we just want to always add it in as the default so that's what this actual crud function is going to do so to do that we actually need to create a another table this table itself is going to be defined a little bit differently so we're going to call this the product scrape event and now instead of using our a sin here we're going to pass in a different field and that's going to be our uuid field and it's going to be columns.uuid just like that and of course this will now be our primary key here just like this so i could totally leave it like this and it would actually work the same way as this one as long as i have that uuid i would be able to actually update that data exactly the same but the uuid is something you probably aren't going to keep track of the asyn is a lot easier to keep track of and we have a reference the actual place we're straight that we're scraping from right so there's an async already in there the uuid is not going to be something that you'll know off the top of your head or really keep track of in this case now the ace and then we still want to be able to look it up actually we want to look it up like this so really we want to be able to go ahead and do this same thing actually doing the same lookup and that's on purpose so how do we actually go about doing that and that is by adding in the index being true here it's really just that simple and so now that we've got that we can just bring this into crud and we can sync the table and change this scrape entry right here which is pretty cool let's go ahead and give it a shot with this original data right here so let's go ahead and do python and we're going to import a few things so from app import crud and models first off let's go ahead and define our product scrape event model here and it's going to be models.productscrapeevent the data itself is models.data so we'll go ahead and run the function of create screen entry with simply crud.create scrape entry of that data and hit enter okay so right now it's saying that there's something missing right so the uuid column i actually never said it this is probably not that surprising but it's true because in here i'm just passing in that default data okay so when it comes to a uuid field especially this we actually need to set it right before we create this item so we're going to go ahead and grab this uuid here and i'll just say data uuid is equal to that uuid field and we're going to go ahead and do uuid one so uuid one includes a timestamp in it which is why we want to use that so it actually gives me some sort of ordering in here as well which is nice so anyways we've got the ability to create this data now let's try that again we're gonna have to do all of the exact same imports give it a moment to load up scrape event the data and then finally we will run the scrape okay so if i keep doing this what i should see is the actual uuids changing which is pretty cool the other thing is i can actually see this by iterating through it so for obj in the productscrapeevent.objects.all i can actually print out each one of these so obj.uuid and maybe obj.acen hit enter and it's very different all of these are different right it doesn't look like they're that much different but at the very beginning you'll see that they are and then we've got all of this test id stuff right so it's actually storing as if it was doing a scrape event itself so in this case what i'd actually want to do is add one more crud item here and this time i'm going to actually go ahead and say add scrape event okay so this time i'm going to go ahead and just pop call each one of these so it's just as simple as the product being equal to create entry of this data and then the scrape event equals to create scrape entry or let's call this scrape object to not confuse it with the name of this method and then again passing in the data and then returning back the product and the scrape object okay so we're going to go ahead and save this and we'll come back in to our model here okay and so now what i'm going to do is run this all over again this time i'm going to go ahead and hard code the dictionary into this data here so we'll do the crud dot create scrape entry and the data asn being you know testing 123 and then the title being hello world okay here we go and so we run this a few times okay and so we should actually be getting different scrape events as well as different products themselves right and we only did scrape entry so let's try that again with the add scrape event not create scrape entry we should be getting both things in our database just like that and there we go okay so on one hand the product scrape event is incrementing the actual product itself that original model is not incrementing but it will change if we needed to so if i said hello worlder it changes the product itself right there and then the product scrape event does not change right but now if i actually have a price in here like let's say for instance 32 let's put a dollar sign in front of it which is why i put price string in here and we did something like that we've got price in here we've got it should actually be price string we definitely need to validate this data a little bit better but for now we see i can actually have all of these events coming in here and if you notice this is actually working really fast considering the fact that this is not on my local computer the actual uh you know session with cassandra's not on my local computer and yours would be going just as fast if if not faster so that's actually pretty cool i think now we can always obviously look at these product scrape events now by doing for obj in models dot screen we can actually print out the obj itself and it's going to be a lot right i just went through a lot okay cool so that is running these different actual cassandra entries it's pretty straightforward on how we do it now the reason for this has to do with a couple things first off this is going to be my detail or the actual list view will feed into my detail view here this is for our actual web application this will pretty much only be on the detail view for the asyn right because we know that whenever we do a scrape event our product is going to store the asn which is that primary key that's the main thing we're always going to want after we understand that part then we can start actually having scraping events happen on that primary product which i think is also pretty awesome okay so this actually is a bit tedious to write it out in python like this right i totally understand that so in the next part we're going to go ahead and implement a jupyter notebook to do all of these same things in a way that i think is a lot more practical for our project in general now we're going to go ahead and start using jupiter for managing a lot of our experiments with our code now writing everything in the python shell like we have been works but a lot of times we forget what we wrote or there's just things that are lost in that context and then a lot of times we have to actually exit out of the shell and then rerun things so we're going to go ahead and use jupiter now if you've never used jupiter before it's really easy to use and it's all based in python so i'll show you everything that's going on in there as well so to start it up we'll go ahead and run jupyter notebook okay so this is going to be really good for our models the actual cassandra models but it's also going to be really good for when we go to scrape our web pages so this should have opened up for you the localhost 888 and if it didn't just click on one of these links or copy and paste them and it'll open up just like this now what we want to do is just create a new folder right here by saying new and folder granted if you did this inside of vs code that would have been fine too i'm just going to go ahead and rename this folder to notebooks mbs and then inside of here we're going to go ahead and create a new python 3 notebook okay so just simply new and ipython 3 or ipython kernel and so inside of here we're going to rename this to working with cassandra models okay so the goal here is working with what i've already created which is inside of this app folder here and so as it stands right now this is fine i can totally work from this but the problem is where the actual notebooks are located so it's not like i can just do from app import crud and models if i run that it's going to say no module name app that's because of where it's located so i need to actually change the directory and we can do that with cd and there now i've been running these cells really fast because i use jupyter all the time to run a cell you just do shift and enter shift and enter once you do shift and enter it will run that cell much like you are pressing enter in python but in here you can press enter a lot inside of any given cell and write out a bunch of things so shift and enter is going to be your best friend okay so in this case what i want to do is i actually want to run that example again of adding a scrape event so thinking back to that we had our data in models.data right so that should give us that original data dictionary and then from there i can use crud dot and this was add scrape event of that data and hit enter so something that is now pretty obvious to me is i can actually look at the data again and run it again and i see there's a uuid field in here so this would have been uncovered in the shell as well but as you hopefully notice it's happening here and it's a lot easier now the reason it's happening is because of this right here this actual scrape entry is adding a field in there for that data for uuid so if i try to run this again i get this error so what i can do is i can actually pass in an argument let's say for instance like fresh right so i'm going to just say fresh equals to true and in here i'll go ahead and say fresh equals two or wait not not there but rather down here we'll go ahead and say fresh equals to false okay and then if it is fresh then we're going to go ahead and do what's called a deep copy so a deep copy is really a simple way to duplicate a dictionary itself so we're going to go ahead and import from up here import copy and now in this condition here we'll go ahead and say data equals to copy dot deep copy of the original data right that way it's not going to necessarily be changing on my original data every once in a while you might want to have this enforcement you want want to prevent the pre-existing already scraped event data from coming in every once while anyways so let's go ahead and restart this kernel here and the other nice thing about the actual jupyter notebooks is it does auto-save for you but you can also save too with just command-s or control-s depending on your system so let's go and run this again and what we will see is it's going to scrape event oops i needed to save my crud file here i don't know if i did so let's go ahead and restart and run all again and now it should actually run that scrape event with the data being fresh meaning this next the data dictionary does not have a uuid now and so we can run this as many times as we want for all of that data so the other cool thing about this is i can actually now go ahead and iterate through all these so for objn let's say models.product.objects.all i can print out the obj.acen right and one of those i can actually use so the asin equals to obj or let's say none first right and just set it equal to one of these we see that okay so now i can actually go ahead and say for object in models dot the product scrape event dot objects dot filter asyn equaling to that a sin and we'll go ahead and say you know like if a sin is not none and then we can print out obj.acen and maybe oj.euid there we go and so right now hey it is none run it again it's still doing it that's odd or maybe this async doesn't actually have any objects so let's go ahead and try that looking at scrape events for that a set yeah it doesn't have any okay so let's actually get one that does and that's going to be testing one two three and so i'll go ahead and just set that directly instead of iterating through like this just set it right here this one should have a lot and there we go and we can continuously run these right so i can keep running a bunch of scrape events over and over and over again and then taking a look at everything that's being stored what do you know and that's pretty cool now something else you'll probably see that's fairly common is to declare this as a variable something like q or qs i use qs most of the time but oftentimes you'll see it in as q and then we can actually enumerate this and then we can actually see the iteration in here for each item so you can really get a count for how many things that have come through here right and of course we can keep running it it actually doesn't really matter at this point we'll continue to get more scrape events while the product itself is not really going to change which is pretty cool so the other thing is you actually might want to see any given query itself written in cql so we can do this with printing out what that query is so here it is right there right so this means that i can actually run this inside of cql so let's go ahead and do that i'm going to import the database and have our session here so make sure you do that and then we can do session.x of that string there hit enter and that will give us a result so i can actually say four row in that execution i can print out what that row is and this gives me that data right so this of course is in the product model itself which is really nice right so obviously cql has a lot more things that i can end up doing than just something like this but at the very least we now see two ways on how we can actually go about using any given item right so the other thing with this is if i actually put it in as q that's not the only way to find out what the query is itself so another thing obviously that i do all the time is by looking at the dir of anything and what we can see in here is this underscore execute query and then also should be an underscore select query right and then in here as well there should be something for different columns and all sorts of really cool stuff so let's go ahead and do our select query and just print out the q dot select query it's a bound method so we're going to go ahead and run that bound method and there's that same query just another way to grab it now we did see timestamp in there as well okay another bound method we'll at the timestamp this case is missing the positional argument of timestamp i'm not going to jump into that really the main thing was really just looking at the underscore select query so you can see exactly what it is that you might need to run at any given time for your execution right just like that so i wonder if we could actually just say for instance my query is equal to that or the c ql query if i can just run like this and let's turn this back into a string here and there we go so another way to actually go about doing that and of course you can print that out as you see fit now hopefully this highlights as to why i want to use jupyter notebooks it's not just for working with cassandra models but just rapid iteration rapid testing so there are other methods that i want you to try on your own so if you go into the documentation which i will definitely link in here at the very top on the github repo if you go in there and scroll down quite a bit you'll definitely see a bunch of methods that you can use for every single model because you might want to have it a little bit more advanced or add additional features to it but we're going to be sticking with pretty straightforward and simple things at this point so now what i want to do is start to actually add additional validation to ever getting to the point where i'm going to be putting it into the database itself right so in other words i have the ability to store things in this data and there is somewhat of validation that's built into it but i want to make one other step of validation so that when i actually start implementing the fast api application then i'll have better actually more robust fast api ready validation for a given view or input now we're going to go ahead and implement pydantic to make sure that the data that's going to be saved with our cassandra models is the correct data so let's go ahead and duplicate this and now i'm going to just rename this to cassandra and pi dantic okay so let's go ahead and jump in here and what we want to do first is just really work off of the base model from pidentic so i'm going to go ahead and leave some of the session stuff in here as well as the imports and then we're going to go ahead and just build off of our pedantic model okay so i'm going to go ahead and run from hydantic i'm going to import our base model okay so what i want is my product base model and we'll give it a base model here so we'll set this as a class and this of course is going to be based off of this right here so the asin is going to be a well a string next is going to be our title and that is also going to be a string and that's it so here is the baseline model we can actually come in here and use that data again and i'll go ahead and say the product equals to that or really the product object equals that and then we can actually take a look at this product object dictionary okay so this is what we'll actually pass in to the cassandra models now you might be wondering what am i doing this for and it has nothing to do with this data because this data is all valid at least for this particular model instead what i want to do is pass in some other data so like say for instance abc123 is equal to nothing right so now it's quite a bit different of data but notice what's coming through it's no longer allowing for this abc123 that's not there unless of course i put it in here as abc123 in str now it will come in so it just cleans out this data and there's no errors to it the other part of this is if i actually have an error with the data then i get a validation error that's coming in here that gives me at least a little bit better of a value that's that's showing up but to me it's actually just getting rid of some of that data that's not necessary in this case and also allows me to change what i want to require on the fly based off of pedantic instead of needing to necessarily change something in my cassandra database now the other reason to use bidante is because if you are actually converting an old project that is running in fast api to use a cassandra database which i recommend you can it's going to be really easy based off of what we're about to do which what we're about to do is well basically the same thing that we've been seeing and that is literally just passing in this scrape data here like this and again we can go ahead and say the product date just like that right and so that should actually create this but we're getting this invalid columns so i need to make sure i'm running all three of those there we go and so we do that again and now i don't have those invalid columns so this is the basic model for the product itself now i actually don't usually call this models instead i usually call it a schema because it's not really storing my data model right so identic might call it a base model i call it a schema so i'm going to go ahead and do that let's go ahead and jump back into our app here and i'll just say schema equals to pi let's give it up hi i'm going to import pedantic and then i'm going to import this product schema here and i also want to have that same thing with my product scrape event i'm going to do class and product scrape event schema base model and again it's this right here and now we have a uuid that we need to add in here and this is of the uuid class so to do that we just are going to come in here and do from euid import uid okay so that gives us that class now if you recall this is an actual optional value both the titles we had those as optional values that's really easy to fix we just go ahead and do from typing we import optional and then we just wrap this around whatever value we want to declare as optional just like that okay so now that we have this we can actually use this schema data so again schema we'll go ahead and restart this kernel here and this time we're going to go ahead and do schema dot the product script schema and we'll go ahead and use oops we actually called this base model we'll leave that in just like that for our testing purposes but we've got a schema here and i'll pass in this data right or actually unpack this data there we go and so what we should have here is a field missing for the uuid field so to better format this we can actually just do try using that schema and then handle the exception which is a pedantic validation error so from pedantic import validation error and we'll go accept as e and then we can print out e dot json and that gives me a little bit better of what's going on here right so the uuid is missing so how would i actually go about creating the uuid data for this one well it's exactly what we've seen before so uuid the actual field of uuid and that equals to uuid uuid 1 just like that so let's go and import uuid and we could totally make a copy of that data as well so data equals to the dict of that data should make a copy of it this time we don't have a schema validation error but it actually is a really nice and clean way to ensure that the data we have coming through is correct right so we should probably write a bunch of tests to ensure that and ensure that it's at least clean here now we actually have to keep in mind that if we do need to add new columns we need to do it in two places one in the place where we're validating this data the other one in the place where we're actually going to go ahead and end up storing it now the cassandra driver itself can validate data but the thing it does not have built into it is this e.json and it doesn't actually work directly in fast api fast api was designed to use pedantic as a validation tool in any given response or view so we're really actually pretty well set up to start actually building our views so we can actually see what it looks like now we're going to go ahead and implement the fast api application at least the basic one as well as the environment variables so inside of our app we'll go ahead and do main.pi and i'm going to import from fast api we're going to import the fast api class and then i'll go ahead and declare the app itself with fast api and then we'll just do at app.get slash and define read index and return a dictionary of hello world now if you've never used fast api before this is it this is a web location right here so to actually run this we just open up a terminal window of course make sure your virtual environment is activated i still have jupyter running but you don't have to if you don't want to and then i'm going to go ahead and run uviacorn and app dot main colon app now uv acorn is the web server gateway interface that we're going to use app of course is this folder right here main of course is this python module and then the final app right there is what i configured here so of course if i call this abc i would need to change that accordingly and so i'm also going to do reload in here and of course this is going to open up a new web server for me so i can open that up and i see hello world now that's fine and all but we actually need to add in the environment variables for this right so the way i have it set up before was i actually loaded in emv and i actually have it set up this way that actually doesn't work in this case which is another reason to have pedantic installed because it actually has a really easy way to use our environment variables inside of our fast api project and that's what i want to do now so inside of this app we're going to go ahead and make a config file so config.pi and i will go ahead and import a couple things first off from pidentic we're going to import the base settings and we're going to also import field the actual field class so we go ahead and declare a class called settings and take in base settings and the settings i for sure need are related to our database that is our client id and our client secret i probably don't need anything else the only thing that i could consider grabbing as well as the actual key space if i did want to change that to being an environment variable so that's completely up to you on how you want to go about doing that i'm going to keep things nice and simple and just do db client id this of course is a string and then db client secret this is also a string okay so the reason i actually imported field is so that i can actually map whatever these values are to what i would have in my environment variables so these values right here so client id and client secret so to do that we are just going to go ahead and say equals field or rather field and then dot env equals to that okay so the ellipsis allows me to just write in whatever defaults are and then go here and then we go ahead and do this same thing again this first one should have been actually client id all right just to verify let's go ahead and grab that one and there we go so these are baseline settings the next thing i need to do is actually add class config and then do env file being equal to dot env okay so it's going to go off of that emv file but that all depends on where i'm actually running this uv acorn so if i actually ran uv acorn inside of app as in just simply main colon then i would make sure that my emv file is inside of this app file instead of in the root of my main project here but i'm going to go ahead and leave it just like that and so the next thing is to actually get these settings items here so we're going to define a method called git settings and i'm going to return an instance of this settings class right here so one of the things that we don't want to do is run this every single time so python has a built-in method to prevent that from being run over and over and over again so we're going to just go ahead and do from funk tools we're going to import lru cache so that it actually just kind of keeps that in memory and again it's not actually grabbing it over and over and over again it just does it the one time which is really the only time we need it is at the beginning right so these environment variables should not change on the fly very often but i will say that it is possible that this environment variable is not in your environment so i'm gonna go ahead and change this actually back to one i should have had it in as one as in true i'm gonna go ahead and add in one more thing in here and just copy and paste it i'm gonna import os and literally just say if that variable is not there just set it equal to one because it may or may not be there and this is just a key to make sure that it is cool so now that i've got this i'm going to go ahead and add in one more thing and i'll just go ahead and call this project name and i'll go ahead and say fast api uh cassiedb or astrodb let's just call it astra okay cool so there's our project email so i'll just go and say name is the string and that's equal to the field dot and env equal to project name okay cool enough now i'm going to go ahead and grab these settings inside of my project here so to do this i'm going to go ahead and do from dot import and i'm going to go ahead and import the config module itself and then declare settings equals to config dot get settings so if you ended up calling this module settings which you totally can just make sure you're not importing settings and then resetting settings settings everywhere setting settings any settings just kidding but the idea here is we now have these settings here so i can actually declare them and use them right throughout my application so this is going to be let's say name and this is just settings.name just a really simple example of grabbing one of the settings items in there and this should be funk tools not just tool and there we go and this should be a string that is why we actually run reload is it it will automatically check all of our code for us and if we refresh in here now we've got that name showing up so it's probably fair to say that all of my secret keys all of the environment variables are working correctly so this of course means that our database file so db.pi will likely have to change at least slightly now inside of db.pi instead of using a load.emv i'm going to go ahead and use that configuration file again and so we'll go ahead and do from dot import config and then settings equals to config.git settings so notice that i'm calling it again and in that last one we used rlu cache so that won't actually run it again if it's ran the first time and so i could do a couple things here i could use the environment variable in this way or i could just simply do dot db and client id and then db dot client secret again we just want to make sure that that's exactly what it is inside of these settings and everything is set up correctly i think it is we'll find out in just a moment i'll get rid of that load.emv now and i'll save things out so at this point i'm going to go ahead and just jump into jupyter now i actually may have needed to change environment variables so if you did need to change environment variables when in doubt close down your jupiter notebook deactivate your virtual environment and then reactivate it and then run jupiter notebook that is often a clean way to fresh out environment variables or flush them out and so i want to just jump in to my notebooks here and go off of any of these and just to verify that everything is still working based off of this session and right here i'm getting that i don't have something called db client so clearly i made a mistake i did db dot id db client.id when it should be db db client underscore id okay simple enough restart and run all and here we go looks like we're doing okay so far i didn't see any big errors yet and sure enough it actually still works so we certainly could have done that method from the get go but it actually only really makes sense when we start to bring it into fast api because environment variables can work off of just that load python or that emv module in fact it still is working off of the emv module it's just working through pedantic to make that happen so this is all good now i actually shouldn't have to change a whole lot here so the next thing is going back into main.hi and configuring a new kind of route well really an event handler for fast api so what we can do is i'm going to go ahead and import the database itself and i'm going to go ahead and declare a session being none and then i'm going to go ahead and say on app dot on event startup i want to actually define a method called onstartup and the reason for this is to have a running session of our cassandra driver so that we can actually access astrodb in our session here so go ahead and say global session and our session itself is equal to db.get session and of course in this case db is the database module so if you have to import from db or from db we can import git session here as well if you needed to and just call that right just so you know if there's any name spacing issues okay so that gets our session which is nice there's one more thing that we need to do and that is something we've already seen before inside of our database file itself so inside of crud we had this sync table here so we actually do want to sync this table we don't necessarily want to do it inside of crud we want to do it inside of the main.pi okay so i'm going to go ahead and comment this out this should actually create a new problem for me somewhere which i'll explain later but i'll go ahead and comment all of these out we no longer need that session in here let me get rid of it in this crud file altogether and really the crud file itself should actually just be literally that that's it right just the model everything related to it and also no more sync table okay so back into main we'll go ahead and bring in sync table here and of course i need to bring in my models as well because what i'm going to do here now is i'm going to run sync table models dot product and sync table and it's going to be models dot i think we call it product scrape event there we go very good so so far we have a session we then sync that table from that session that session of course is setting a connection and a default connection here so now what i want to do is just respond back with a query set of some kind so what i'm going to do here just to keep things nice and simple for me is i'm going to go ahead and say read products or really the products list view and i'll go ahead and go into products here and what i want to do is i want to use the product model itself so i have it in here like this right now i'm actually going to redeclare it as product equals to that product model and that way i can save myself some typing here everywhere of course i could obviously import each one of those individually anyway it's kind of up to you on how you want to design that but here we go so now let's actually turn our product list view in here and i'll just go ahead and say results equals to well we've already seen this one and it's list the product itself objects.all okay so that's really really close right let's actually see what happens here as the result now going back into our fast api app it's still running so let's take a look at it and i'm going to open this up into a new window here go into products hit enter what do you know it's actually giving us all of those results that is fantastic right but what if i only wanted to show the a sin in title for example well this is actually where i can start using my schemas right so i have this product schema here and perhaps this is it i just want a sin and title no price string right so let's go ahead and grab that and i'm going to come into or actually let's go ahead and do one more schema in here and i'll just call this my product list schema right because perhaps i want to bring that the actual price string back in uh based off of my model itself right it would make sense to have it anyways so back into our view that is what i want to change this result to look like so to do this we're going to go ahead and add in our schemas as a import here i'm also going to go ahead and import from typing we're going to import a list and what i can do on any given decorator for the route that we're handling i can add something like this where it's response and model and it can be equal to some sort of value so in my case it's going to be equal to the product list schema but rather a list of the product list schema just like that and so if i say that and refresh in here what i should get back is the results being well close to this but unfortunately the way it's currently set up it's not actually going to give me the correct results back so let's go ahead and cut this out and literally just return back a list of items here like that not in a dictionary otherwise we will probably get an error here and it looks like i am oh yes and product list schema is also not defined so let's go ahead and do schema dot product list schema and let's run it again i'll break this down so we can see everything here so far so good refreshing here and this in theory will bring us back that product with schema looks like it got it looked it up and there we go right so at any time then i can jump into my schema and add in the price string again as a optional string we save that schema we refresh in here and now it's going to also not just validate that data like we've already seen but it's also going to display it right instead of displaying literally everything that's in our model right yet another reason to use pedantic we did it for validating but also for sort of serializing the data we want to display which i think is pretty cool pretty pretty neat actually and so now what i want to do is take this one step further and grab this product scrape event schema this time i'm going to go ahead and do product scrape event detail schema so it's going to be pretty similar to what we've just seen but this time no uuid okay so in other words i want to actually now have a new view for an individual product right and based off of a send so i'm going to go ahead and copy this and we're going to paste it down here and this time i'm going to go ahead and change the response i'll just get rid of the response model for now and then we'll go ahead and pass in a sin in here now the way this work is very similar to doing string substitution but instead of doing string substitution it's going to be doing string extraction and so that actual value will go into this function right here so this is going to now be our product detail view right and so one of the things we could do is we could just have our schema coming back with the product itself right so we could do product.objects.get and asen equals to a sin and of course that's not actually a list anymore so we can actually put this in as a dictionary okay so we save that and make sure the server is running jump back in here grab one of the a sins in here hit enter and what do you know there we go we now have a detail view for that that of course is all of the data and so what i can actually do is say the data is equal to that return that data of course i can also add one more thing in here called events or something of that type and now what i can do is yet again using a list this time based off of the product scrape event dot objects dot filter and again a send equals to a sin this should be initialized okay and so this is going to give me a list of all of those product events at least it should and then we can refresh in here and what do you know it's actually giving me a list of all of those items that is pretty cool but i don't want that uuid here so all we need to do then is pull this out and say events equals to that then i want to use our schema the product scrape event schema and we'll go ahead and do events again this time it's just going to be simply that schema with the data itself so for x in events and then come back with that there we go and we refresh in here and we get a server error let's see what's going on oh that actually is not imported so we'll go ahead and use schema dot now it should be good and if we refresh in here we have that data as we wanted it to see it pretty cool and so the other part of this of course is in the schema you know perhaps again we want to have that price string all around and so we can just change these things on the fly which is exactly why i like using the pogentic for rendering things out like this and so now every time going forward i would be able to actually see every one of these things being scraped so if i actually did open up my cassandra models this is still something i'm going to need to solve so i'm going to go ahead and solve it right now in the sense that if i restart and run all what's going to happen is we're going to get an issue with trying to use one of these data models or at least in theory we would so the reason it's not actually working right now is likely because we have a session running but if we didn't have this session running what we would need to do is come in to our main project here and actually call these things right here right so let's go ahead and give that a shot we'll bring in maine get our session here and then call on startup and see if that actually will run those things if it doesn't then we would just want to call this every time we run in there oops this should be main on startup okay and so that should actually set the sync table correctly in there and so now for sure when i run these events they should work just fine it's certainly possible that they were already running but if i took down fast api it's a good chance that they aren't running anymore but now that we've got this we should see quite a few more right now the last thing about this event is i actually do not see the timestamp here and so what i want to do now is i want to actually use my pegantic stuff the actual schema itself i want to help this so i can actually see the data that i want to see in terms of a time stamp related to that uuid one looking at all of the different scrape events there's one critical flaw to the current data situation and that of course is it's missing the actual time this occurred like did it occur a week ago a year ago 10 seconds ago we have no idea well actually we do so if we actually look at our model itself we have this uuid field and this case we actually used uuid.uuid1 and that of course has a time element to it now cassandra driver you actually could have used the time uuid field and actually these two essentially are doing the same thing in our case because we are creating this uid field so by all means use either one but in this case i'm going to go ahead and actually parse this uuid field into an actual date and time object so to do this there is a gist that i created specifically to handle this challenge so if you go to gist.gethub.com cutting entrepreneurs you can find the cassandra time uuid to python daytime markdown file and this just right here at the very bottom this is it it's really really simple so if we just grab this and jump over into our schema here actually i'm going to go ahead and create this as a utility function so we'll go ahead and do utils and paste this in here really really simple so even if you pause right now you could probably get the gist of it but the idea is we use a date time object it's kind of strange the starting date is october 15th of 1582. uh there's an interesting article so by all means check that one out and then we're going to go ahead and get the remainder from the time that we pass in here divided by 10. so this is actually the way to convert auid one time to a daytime object right so it gives you an example right there it's not really that complicated the part that you might be new to is actually implementing this or augmenting one of our fields in here so what i want to do is i want to actually import another thing from typing and having any type here now the reason i have an any type is well how i want to actually represent this value so again i'm using the product scrape event detail schema i'm going to go ahead and say created and this is going to be an optional value of an any type and we'll go ahead and say none so with this what i can do is i can actually use what's called a root validator in pedantic root validator just like that and what this will allow me to do is to actually enrich my values so if i use this root validator with pre being true we're going to go ahead and define and i'll say extra create time from uid right so essentially i'm just going to go ahead and assume that this time from the uuid is going to be the created time right so in my case it certainly is but it's possible that you didn't do it that way so i'm going to go ahead and add the class value here and values this is the standard operating for this root validator decorator and so what i just need to do is add the key value pair that i want to add so in the case of values i'm going to go ahead and add the created field here and i could start off by setting an equal to values of uuid and then to return those values so what this is doing is just adding an extra arbitrary field to my validated data which we could verify in our project and there it is right so of course this is that uuid we need to change that with our new utility function here so to do that well we can actually call.time on here and i'm actually very confident that it will give me the correct value or at least that the uuid will never be none because the data that's coming in here from the model itself the uuid is the primary key it definitely is going to be in there you know of course if it wasn't definitely going to be in there i would probably want to have some conditions with this and so what i want to do now is actually bring in that utility function so we'll go ahead and do from dot import utils and then just simply wrap this as utils dot uuid time to date time so you do have to pass in that time there and then we're going to go ahead and refresh this and there is your date time object now you can also do a timestamp here if you were interested in having the actual timestamp as this value oops timestamp is not just dot timestamp but rather timestamp and we bring it back and this will give us that timestamp of value so you could kind of pick and choose which way you want to go about doing this now you will notice it doesn't take a little while and it will take a little while based off of this data in other words actually enriching this data does take some time because it's going to be parsing these things out and doing all that so the next part of this is i just want to go back into main where the view is where i'm actually getting this data right so the events right here and i want you to limit the amount of items coming back and that's really simple we just go on this query set here i'll break it down a little bit and just do dot limit and limit the number that you actually want to show so if i set five that should be fine i refresh in here and this will give me now the most recent five so what i would do to advance this is i would actually add one more view and say events or slash using the acen events and this time i'm going to go ahead and use that response model again as a list and we would come in here this time of course it's going to be our scrape event as our response which i don't have to or rather the scrape event detail here and again we can still have this list coming back but now we could have you know work towards actually using pagination i'm actually not going to do that right now but we can actually do scrape events list view and then that way if we did want a lot more data about any given event this would be the way to do it and if we did pagination then we would be able to limit each one going forward as well i think this view is quite a bit more simple than what we had up here but it is nice to know that we can do both of those things and so at that point then i would just say event you know endpoint or maybe like the events url and set that equal to basically this string right here and of course whatever domain you end up using and then that way on the actual detail page the user would know about that right and so if we collapse that down there's our events url that's pretty nice so um if we actually wanted to extend this granted we could take a lot of time doing it and i will say one of the things to consider in all honesty is to not actually convert this on the fly like this but rather maybe convert it at the beginning like as in in the model itself like put in just the timestamp itself as another field so that when you actually perform crud when you create this entry you add whatever that is as well so that is going to be certainly a lot more efficient than doing what i just did because it's going to be in the data itself but the reason i wanted to show you all of this was to show you several steps to get there right actually enriching your data a little bit more is something you might end up wanting to do as well as actually going off of you know maybe the actual product schema itself you could enrich the product scrape events from that product schema in theory i don't want to actually spend the time doing that but the idea here was we actually did some adjusting to a pedantic actual base model the schema that we created there we use this utility function and something that's very very common inside of the cassandra database now we're going to go ahead and create an endpoint to ingest data that is i'm going to allow pretty much any source to scrape this data right here and send it to my fast api project which will send it to our astra db of course our cassandra database and so what we want to do is really simple actually if we look at the schema that we want to use it's this one right here this of course corresponds very well with our crud action for this scrape event right so we definitely want to put that one in and then we actually want to update a endpoint for this now the end point i can use is literally this one right here it's going to be very similar to that right there but what we don't need to do is have the same endpoint i'm just going to call this events slash scrape right simple enough the response model i actually don't need the response model in here but i will use the actual schema still and i'm going to go ahead and say data is of the type schema that actual schema data right there and of course this is going to be the events and scrape create view right and so we're going to call crud so we need to import that so import our crud options here and then we will just use that data so we'll call the i don't remember what the response is so we'll just go ahead and do crud dot and then we'll grab the add scrape event here then we'll grab this data which is the dictionary right so that's going to be validated to the pedantic model here and then what gets returned back is a product object and a scrape object both of which we can still use and we can actually roughly put it together like this right i don't actually want to do that i'm just going to go ahead and return back the single product list schema item which is really just that product object so the scrape object i'll just leave as an underscore just like that cool so now that i have this endpoint it's actually time to test it out right i shouldn't have any errors on the fast api portion and i should still have my jupyter notebook running if you don't by all means just run it into a new server itself and so inside of our notebooks here i'm going to go ahead and create a new python module here and we're just going to call this send data to scrape event url or endpoint and of course this is going off of our local host so first off i need to import requests python requests to actually you know send an http request here and of course if you don't have that just run pip install requests on a different cell without the hashtag in there and then our endpoint is going to be that right actually not that it's going to be events scrape okay and we need to make sure oh we have the wrong method here this should say host okay so there is our endpoint and we've got our endpoint here and so the data values well let's actually just call it as is so requests.post to that endpoint and we'll just say json is just an empty dictionary and we'll print out r dot json and we run this and it says a value error missing right so we're missing the body and the a sin here okay so we're missing the actual asin as the primary thing that's missing and so we can go ahead and just grab an arbitrary one or just test this out so i'll go ahead and add it with a sin and we'll go ahead and do http test and i'll give it a title of you know abc123 and then a price string being another one or rather let's do a proper price string like that we post it back and what do you know there we've got if i do it again and again and again and again it's always going to be that data so of course now what we can do is still using requests we can actually look up this data let's see if it even came back by going into products and then the asin that i used which was http test and now we'll go ahead and call this just endpoint 2 and then r2 equals to requests.get in point 2 and then r2.json we'll print out r2.json okay oops that should um not say heb test rather a sin and then put the a sin equal to that original one that we had up here that would probably be nice up here as well cool and now we run it and what do you know we get that same data coming back and of course it's going to grow if i keep adding stuff it's eventually going to grow except well in this case it's not but if we go into events which i did not paginate it should actually give me all of them and it will just continue to give me all of them or at least it should right so it's certainly possible that it's going to stop maybe i actually didn't update the limit let's go ahead and check um i certainly did leave the limit in here so i'm going to go ahead and get rid of that limit and we'll take a look at it again i actually don't have to refresh anything i didn't mean to do that but now if i refresh the get request i see it coming through so i did mention you could do this anywhere and that's actually the point is i want to be able to do this anywhere so to do that there's a tool called ingrock so if you do ingrock.com just like that and download this tool what you'll do is you'll open up another process on your machine and run in grok http and then the port you're running fast api on in my case i'm running it on port 8000 which i can see again by just re-running it and i see it's on port 8000. of course this is also true by just looking in the url itself but now when we do this i can actually use the ingrok base point or the ingrock base url which is listed right here right so i can actually use that either one doesn't matter right so one is http one's https so you can go ahead and grab this and come back into well i can open it up and notice that it says hello world it's all of our fast api stuff but i can also go into my you know jupyter notebook here and say base url is equal to that and then the path is equal to well which path are we trying to use here and that's the events scrape right and then our endpoints now is going to just be combining those two so base url and path and so yet again i'll just go ahead and post some data based off of that endpoint and i can continue to do this continue to do this and then go ahead and go ahead and do that quick look up there and we'll see a lot more data the best way to tell is to put the length there and it's at 16. let's go ahead and run it a couple more times now it's at 20. pretty cool and so that is a really simple and easy way to ingest new data through fast api and now to me it's not about actually using our local jupyter notebook but rather to use something like colab and you know if i'm like oh well something seems to not be working correctly so let me open a notebook here and we'll talk about the doing the proper scraping later but i'm going to go ahead and call this right here so let's go ahead and import requests here colab itself is a jupiter notebook and it allows you to run all of the same things as your local jupiter notebook a lot of things are already installed in here but of course if you needed to install anything you would just do pip install abc123 you can also use apt in here for linux installs so apt-get like selenium or all sorts of things in there too which is cool but now if i call this remember ngrok is an actual endpoint this is out to the world i run that and oops i don't have a sin in there should have left it in that other cell okay so there we go and i run this again and again and again this is literally a live production server going to ingrock that goes to my local server now do keep in mind ingrock is well it's pretty secure but it's probably not the most secure possible and the other thing about ngrok is when i close it down and then re-run it you're going to have a different url so it's not necessarily the long-term solution here right it will give you something like this so you'll definitely need to update your incrock endpoint now one of the ways you could do that is by putting just like input here and saying like what is your ngrok url something like that and then just you know copying this and coming in here like that there you go i am not going to do this right i'm not actually going to be using incrock in the long run to do this at all i was meaning to show you this only to show you that it can be done and it is a certainly effective way to work off of our fast api and yet another reason to just have face fast api running a lot of our cassandra database stuff because it gives us this endpoint that we can now work with in many ways not just to uh you know update data but also to actually see that data to scrape it to look at it all sorts of things everything that you need to do now of course this is only a big part of it right so what we do want to do at some point not in this video but what we do want to do is actually submit this to a production server to actually work on production but the reason that we aren't doing production in this series is related to this connect.zip here now there is a way to encrypt this and make sure that it's going into production securely but it's just adding a lot of steps to something that doesn't need to happen at this time which is why i wanted to show you in grock as well but now that we've actually accomplished this we've done pretty much all the things that we want to do in terms of storing our data like actually getting the foundation done right so if we want to add things to it in the future we might do that as in adding fields to our model and updating our schema accordingly and maybe even the different urls that we might have but actually what we want to do now is we want to implement salary we want to actually start having a worker process that's going to actually run on a regular basis for us now we need to talk about creating a worker process and the logic goes something like this you look at a function in this case hello world and then when you want to run that function or execute it you just call it but the question is what if i actually wanted it to be fully executed in the future whether that's in one second or in a week or in a year how do i actually do that the other thing is what if i wanted this specific function to run every monday at 9 00 am right that is actually what we're going to be using celery for so python salary is how this actually happens you're able to use it to both delay tasks from running and also to schedule those tasks now the cool thing about delaying the tasks is it actually can be ran on a multitude of computers in other words you can have a huge cluster of computers running these tasks for you you don't have to use your main web application computer or the current computer you're on which is really really cool granted you need the code and it all need to be set up correctly but that's actually what you could work towards now we're not going to be doing that complex of a thing yet but the idea here is we're going to implement celery now celery itself relies on communicating through messages or a message broker that is why we're going to be using redis think of redis as a big key value store that your salary application is going to work when i say key value store i mean just like a huge dictionary it has a bunch of things in it it's going to run these keys and then celery is going to manage this process for us so it's going to insert keys and then delete them on its own without our intervention but we definitely need redis to work otherwise celery won't work so at this point i do recommend that you set up redis on some machine whether it's your local machine or a remote host now on our repo in the setup folder we have a entire guide on how to set up redis on a variety of different machines so if you just want to get things going create a virtual machine on some cloud host and run these like four commands right here right it's re well five commands and it's really really simple right and then you're gonna ping it that's a really quick and easy way to get redis going now if you're on a mac you can use homebrew to install it but if you are on a mac or linux i recommend using docker it's so much easier to just run it with docker for a couple reasons which i'll show you in just a second windows users if you can install docker use docker otherwise there's a tool called memorize that you can use that is exactly like redis um so yeah a lot of options here so now i'm gonna assume that you've already installed redis on your system and you can actually ping it so one of the ways to ping it is like this now the reason i like this way is because i can actually change the host right here on this pinging and this doesn't actually rely on any other third-party technology at all unless of course if you're on windows then you might not necessarily have this so you might have to use redis cli and ping okay so in my case my redis server is not actually up and working at least at this port so what i want to do is actually run my redis on docker and it's really simple i do docker run dash i t dash rm this means that it's going to run it in interactive mode and it's going to remove it and then i can just call redis just like that that's the official image right there and what do you know there it is right so that's actually pretty cool i think so now what i can do is reping it right so redis is running up there and i call that ping again and i'm having some issues right so notice the port notice the port this is why i really like docker my local version of redis is not working correctly but it's also not connecting to my docker version so i'm going to go ahead and close this out with control c press up again and now what i'm going to do is i'm going to map new ports here so dash p for port i'm going to expose 6380 to the internal port of 6380 and then i'll just call redis and then dash dash port and 6380. we hit enter there is our new port here and so if i want to ping it again i can do that with just simply a different port and now it gives me a response i can also do this with the redis cli i can ping it with changing the host or changing the port so if i change the host i can change it to localhost this time and just do porch being 6380. hit enter and in this case it's actually not pinging it back with the localhost so let's go ahead and put it at 127.0.0.1 and oh that is also not pinging it correctly let's actually change the order of the ping the local host should have worked i should put the flags in first i think and that's what it was okay so um that is not being able to be seen there we go um and so the idea here is we just need to make sure that i can run these commands somehow right if you can't do it the rest of this is not really going to matter at all so with this in mind i'm going to go ahead and use this as my actual redis url so let's jump back into our project into our environment variables first and put in your redis url just like that you can use localhost you can use an ip address right so like if you were going to deploy this on some server which i actually definitely recommend you try out it it's really cool to see because you just install docker and then run it and then it's boom it's ready to go you can always delete that virtual machine at any time which is awesome okay so like if you're using digitalocean droplets or lynode or aws or gcp or you know the many places that you can actually use virtual machines okay so i'm actually going to leave my redis url like this okay so now i need to actually use this redis url notice i haven't configured anything related to celery just yet i want to actually bring in this redis url into my main configuration for pedantic now celery itself does not need petantic i'm just using this as a way to load in virtual environments so in here i can actually declare the redis url very similarly to what we've done before which is adding a field making it required with the ellipsis and then declaring the environment variable mapping in this case i'll just go ahead and say redis url i think this is actually a bit redundant we could probably just leave it as redis url because it will map to the uppercase version as well cool so we now have that setting okay so since we're all set up with redis i shouldn't have to do anything else we'll see with celery itself in the future if i need to do anything else and i also want to make sure that redis is always running right when i am using the worker process i need to make sure redis is running i'll show you little strategies on how to ensure that it is running or at least know what to look for if it's not okay so we save it and let's go ahead and create our app here this is our worker.pie so i said create an app because this is a salary application it does not need anything else to run the only reason i have configuration like this is to load in the environment variables i could absolutely do this exact same thing all over again individually for this worker process i just don't need to so let's go ahead and do from celery import the class of celery and then we'll go ahead and declare app equals to celery and then we use the name of the module cool um and so we could leave it like this right i could actually use the app here now the reason i'm not doing it in this project is because my primary application is actually my fast api application which is also declared as app i don't want things to get confused especially when it comes to the decorators which we'll see in a moment so instead of using just simply app we're going to use celery underscore app this gives it some distinction that it is a celery app because of how we define tasks so let's first off to find a function and we'll call it random task give it a name or some sort of argument in here it could be any argument we'll explain that in a moment and we'll go who throws a shoe honestly and then put their name in right silly silly function but now we want to actually turn it into a celery task and we use the decorator celery app of course it's whatever this is named dot task so you can see if i actually called it app and app.task you might actually accidentally look at this function look at your primary app and be like oh this is a fast api decorator it is not it is a celery decorator cool so we're almost there not quite but almost now typically speaking you will see this a lot where it says redis url is equal to just redis colon slash 127.0.0.1 colon 6379 you'll see this a lot because that is the default redis port and host sometimes you'll see this as localhost but that's the default and then you'll see salaryapp.conf.broker url is equal to that redis url this is letting celery know at least for the worker process that this is the redis we want to use or at least the broker that we want to use you'll also often see the result backend being stored in redis as well okay this could be completely different as well you can put it into a sql database you can put it into cassandra a lot of places you can put the results we're not going to worry about that right now but the idea is you'll see this and you might remember i actually made a configuration item for this right specifically for environment variables so this is the method that i used in my primary app to get environment variables so i'm going to use it in all other apps that have access to the same code so we're going to go ahead and do from dot import config and then of course our settings is simply settings equals to config dot get settings of course we've already seen this before and now our redis url is settings.redis url just like that exactly what this is okay simple enough i think and now let's go ahead and run this okay so i have the jupyter notebook running or in this case it's fast api i also have my jupyter notebook running and i'm going to go ahead and run my salary application now a lot of the processes he's running at this point to do this we can call celery itself or you can call python dash m celery either one gives you the exact same command so celery dash dash app and dash a do the same thing i'll leave it in his dashes app and then we go app.worker.celery app so app of kin is this folder worker is this python file and then celery app is this variable right here which of course could change and then we just call worker and then we can actually use log level which is the same thing as doing dash l but dash log level and info will give me info about what's running okay so it's running notice the redis url here okay so that's exactly the redis url we configured in our environment variables all right and also in docker now let's pretend for a second that you actually did not have like maybe you forgot to run docker so you close this out now in our worker if we scroll down a bit this is what happens it actually loses connection connection refused okay so if we run it back up then it should be able to regain connection which is just so cool i think all right so let's actually go ahead and do an example using a worker test this random task here let's actually take a look at an example of that now to do that i'm actually going to go jump into my jupyter notebook here or jupyter notebooks and i'll go ahead and just duplicate one of these notebooks let's duplicate working with cassandra and pedantic just for some of the boilerplate that's in there and we'll go ahead and rename it to working with celery basics okay so let's jump into there and what we want to do of course is from app import simply worker okay we don't need uuid and we don't need a lot of these other things so edit and delete cells okay so the function name was simply random task okay so actually what i'll do is i'll actually import random task just to make it really clear and really easy to see what's going on so random task and i'll pass in justin i run that oh let's make sure it's defined and there we go who throws the shoe honestly justin how silly is that okay so celery wasn't used at all that's just a standard python function right there so how do we actually use celery and then what we do is we call dot delay and then pass in the arguments it's really just that simple now though inside of jupyter we see a task result this is the actual task id and the result for it or the actually excuse me this is the asynchronous result for it so we can actually look that up somewhere else if we needed to but if i actually go into celery the celery process there is that printed statement let's do this several more times so i'll do 4i in range let's do it 10 times something like that we're going to go ahead and delay this task run it look how fast that was and if you go in here if you scroll down it actually shows us that now we can make this a little bit more explicit to what that iteration was by coming in here like that and scrolling down we now see each one of those tasks iterations coming through so that's pretty cool this is actually a simple way on how to delay it now there is another way and it's called apply apply async just like that and what we want to do with apply async is now we actually pass in args and these arguments should be in a tuple just like that and then we can also do countdown and maybe like 10 seconds right so what this is going to do then is very similar to delay is it's going to wait 10 seconds to execute in fact delay itself just does not have that countdown in there if we comment this out that is now exactly what delay is right so inside here i actually need to put a comma there like that you want to make sure the values inside of the arguments are a tuple as well cool so now let's give it a shot i'll run this and so what's going to happen here is it's going to take some time to actually run these notice that the actual salary worker did get these random tasks here right so these tasks were received and then it actually will start running them but it doesn't necessarily run them in perfect order right it runs them when they when they can it is mostly in perfect order but one actually went first and then zero this all has to do with how the worker process sort of distributes task to itself in this case because it's not it doesn't have more machines but now we can actually run celery and delay tasks later this is going to be really important when we actually start to schedule tasks or run scraping when we want to and then storing it into a cassandra database unfortunately we can't just run cassandra as it stands right now inside of celebrate we have to do additional configuration specifically for that now we're going to go ahead and integrate celery with cassandra now the way this actually is going to work is we're going to define a celery on startup method that will go ahead and just collect all arguments that come in here like that now for now i'll just go ahead and print out hello world okay so this we want to trigger every time celery starts now this is really good for using cassandra but you can do this for all sorts of things so if you need to actually run some sort of long task before a salary worker or a beat server starts you can do that right here so to actually wrap this into salary we're going to go ahead and import the signals for it so from celery dot signals we're going to import two different signals here one is the beats in its signal the other one is the worker process init signal now the beat signal is for scheduling tasks we'll do that later but for now we're just going to focus on the worker process itself and we'll just go ahead and connect both of these things to our now startup method now if you look at this it's actually not a whole lot different to what we did with our fast api application right it actually has a function here that we can call at any time and in there we're going to essentially do this it's going to take a few more steps to get there based off of cassandra's documentation but in our case we're going to go ahead and first off start just like this so save it let's run our worker and what we'll see is a bunch of print statements for every single worker that's inside of salary in this case it's 16 workers which is pretty nice we can actually use this all throughout and so when we actually call a task it's going to pick a different one which we can verify by going into the notebooks themselves going into celery basics right if we run this here and delay all of those different tasks we should see various workers coming through right it's not always going to be the exact same worker now when it comes to cassandra it actually most likely will be the exact same worker because of how we want to set this up like i don't need to have 15 different workers on the same persistent connection we just really need one connected but we definitely need to connect it so to do this we are going to go ahead and import a couple things from our configuration so the database and our models and so the initial thing would be like oh well maybe i go ahead and do session equals to db.get session now that seems like it's okay but unfortunately what happens is it doesn't connect the correct way right so we actually need to redo this inside of that celery init method so what we're going to do instead is get the cluster itself okay so let's go back in here and instead of doing session we'll do cluster equals to db.get cluster and now we'll go ahead and declare our session equals to cluster.connect okay and so we actually want to connect this to cassandra again so we'll go ahead and do from cassandra dot cql engine we're gonna import connection okay there we go so down here we'll do connection.register connection the string of the session and then passing session equals to session and then connection dot set default connection again the string of the session and the session equaling to the session okay simple enough or actually we don't need the session argument here just the default okay so again this is very similar to what we did here except i didn't actually implement these methods i didn't import them directly i just imported the connection all together so it feels like we might be done here but not quite right so if we run it again let's close this down with ctrl c let it close down fully run it again up i got a syntax error here need a comma after config let's try that again and so it's going to try to connect many many times which it is fine but it's actually now connected a lot too much too many connections going on right now and that's going to cause a problem in the long run so what we want to do instead is connect it only one time which i will do but before i do that let's actually set up a task for our models so our models themselves are going to be the product model which is models.product and then the product scrape event model which is models.product dot scrape or not just simply scrape event okay and then we also want to sync the tables so from cassandra sql engine and management i'm going to go ahead and import sync table okay so let's sync the table for each one of these so sync table of product and sync table of product scrape event okay so we save that let's do a task here and i want to just go ahead and print off the or let's go ahead and do the list products okay so what this means is i'll just go ahead and create a list here of product.objects.all and then we'll go ahead and do values list of just the a sends and we'll go ahead and say flat equals to true okay so we've got a new task here that we want to try out overall this seems like it's ready to go let's go ahead and run our worker process again i'm going to close this down because i actually made changes to the tasks so whenever you do make changes you need to refresh everything looks like i forgot a parenthesis here let's make things a little easier on ourselves by actually placing things a little bit cleaner okay there we go let's run that again and again it's connecting a bunch of times and if we go into our notebooks i'm going to go ahead and duplicate this celery basics one and this time i'm going to just call it celery and cassandra okay so the task that i want to import is this list products task and then we'll call the list products task itself i don't think i need the arguments in there it doesn't actually matter if i do and then we'll delay it as well and i'll delete these other ones edit and delete i will actually restart this kernel and get rid of the arguments we do not need those at all okay so let's go ahead and restart our celery and restart our kernel in here assuming that i saved everything so restarting clear output there we go and let's run this so list products i cannot actually list products this way okay it's not allowing me with this connection or at least this current connection now i'll explain why that is in a moment right here we can actually delay it and what do you know it actually delays these tasks okay so one of the downsides to how it currently is right here is it's certainly possible that we have a current connection right so what we want to do is just be on the safe side and say if there is a current connection so if connection dot cluster is not none we're just going to go ahead and shut that down so the connection of course is this cql connection and shutdown and then again if the session is not none we'll go ahead and shut that down too so this will mean that i don't have you know 15 sessions open i just have probably one okay so let's go ahead and restart this again and come back in here and i can run this again i can run this as many times as i need and it will actually run the session and print out those nodes right let's run it again and again looks like it's not getting that recent one there we go cool so it is now actually delaying these tasks so your next question is well why didn't this one execute well that's because this current actual jupiter notebook does not have a session so if i did from app import db and now if i call session equals to db dot get session now i should be able to list things out it's going to take a moment but now it works and that's because of how this git session works and the fact that jupiter doesn't need it to initialize the same way that celery does and the other part of celery itself is if you think about it the way celery is going to work is not in the same process as as jupiter or fast api so jupyter itself doesn't just doesn't need to start it up because inside of the database itself that actual command of git session is starting it up just like we did with the worker process but the thing about salary is well we totally could have our task started up but that's probably not that efficient especially to run this over and over and over again on every individual task now if you're doing a ton of things a huge workload perhaps it will make sense to have it inside of the task i just don't know when that ever would be and this is also what the cassandra documentation recommends as well but at this point we now have it fully aligned it actually now can work completely inside of celery and cassandra so what we need to do now is actually see how to do periodic tasks that is how do we actually schedule tasks to run maybe like every 10 seconds how do we do that now we're going to go ahead and set up periodic tasks so inside of our celery worker we're going to define a set of periodic tasks it takes in a sender as argument and potentially other args and keyword arcs sender is the most important thing to us and so it's really simple we do sender dot add periodic task and then how often we want this thing to trigger let's say for instance every second and then we actually call the task we want to use now the easiest way to do this is have the tasks on worker.pie there are ways around that but having them on worker.pie means that i can just call it like random task like this and then we can actually call it with a signature we do dot s and i'll go ahead and say hello we'll take a look at that signature thing in just a moment but for now this is the method right we just add the periodic task and then we need to wrap it into our app with celery app dot on after configure and dot connect okay so after everything's configured including setting up our you know celery application we now have this periodic tasks here so let's go ahead and give this a shot so every second now it's going to run the random task function with the word hello in it okay so now to do this it's just like what we've seen before so celery app app.worker dot celery app and we can actually call beat here and if i hit enter this will set up my beat worker now i could do it this way right and actually just go off of a different process for my beat or i can use the worker itself and then do dash beats this will do both of those things the worker and the beat so it's really up to you on how you want to go about doing this i'm going to keep it on beat for a little bit and use this one but what we should see is every second or so after it's configured it should actually be printing out the you know weird thing that we said right here it should actually show that up but what we're not getting is the actual information for it so just like with our worker we want to go ahead and do our log level and info hit enter and now we should be able to actually see the printed statement for it and there it is right so it's actually sending or scheduling it sending it to actually run but the thing is this is just scheduling it that's what the beat server does it only schedules these things it doesn't actually do the worker process so in order for us to do the worker process let me close down the explorer here i'm going to open up a new tab or actually let's split the tab by going back up here split it up and we'll do source bin activate and then we're going to go ahead and do celery dash app app.worker dot celery app and worker and then again the log level being info hit enter now it's actually going to consume all of these tasks it should have a number of them once it's all completely set up it will go ahead and start running these tasks right it's going to have a lot that are backed up which we can also verify by going into redis itself but here we go so here are those tasks and now they're actually executing so there is certainly a difference between the beat process and the worker process now they can go together right so again the command for them to go together is celery app app dot worker dot celery app and then the worker itself we've already seen that and then if we do paint there you go right there and then we can also add that log level in here of info and with our beat we can also add in a location for our scheduler so we can say call it celery beat dash schedule there are other ways to use schedulers there's other back-ends to actually handle the scheduling for this but i'm just going to keep it nice and simple and what you'll see in your local application is that schedule will be right there right it's actually showing you that schedule which is pretty cool so it's actually going to pick up from what was already in there by default and so this is actually a task that will run every second forever now let's go ahead and cancel this out i'm going to let the worker still run eventually we'll put them together like with this command but i'll let that worker still run and what i want to do is i want to only run this task maybe for a certain duration so i'm going to go ahead and say expires and we'll do 10. okay so we're going to run this again and i'm going to let this run i'm going to clear this out with command k or we could always just reset it anyway and so i'll come back to it in just a moment but while i do i want to show you what the signature thing does so jumping into celery basics again i'm going to just really simply look at this random task here and we'll do it again and i'll actually call dot s on this and if i hit enter it's a slightly different result than what we've seen this is just creating a signature for what needs to be called right so i can actually call it just like that and so creating that signature is essentially letting the periodic test the you know the beat server tell celery the actual worker it's going to tell it hey you need to stop you need to stop running this this task right here okay and so we might already still have a lot of tasks going so i'm going to have to let that finish out before we can verify that it expires but the just the general idea here is it now has this the signature so i can use this signature too like this and i can pass this all around right so if i ever needed to call it again i would just call like that which is nice so that's just a really cool little feature for those signatures and that's how you want to make sure that you're using the scheduled tasks with those signatures but of course having expiration is nice i'm going to go ahead and change this to just abc just to see if that expiration is working so eventually we'll see abc over here instead of hello but we also want to have a another item here now i mentioned earlier that perhaps you want to run every monday morning right so what i'm going to do here is i'm going to import the schedule cron tab so we're going to go ahead and right above the signals here we're going to do from celery dot schedules i'm going to import cron tab okay so if you're familiar with cron it actually works a lot like that but even if you're not familiar with it i'm just going to give you a really simple example here okay so we've got our periodic task here now what i want to do is the schedule up front this is one second so of course you could do any number of seconds here for it to run right now what i want to do here is actually pass in the cron tab so cron tab cron tab and then we can do hour of i don't know 8 minutes being zero and then day of week is going to be let's say well let's say it's every tuesday it's gonna be the second day of the week so there you go that would actually run it every tuesday at 8 a.m okay of course you could do it at monday you can do it you can do it all sorts of places on that and if we look in the documentation do a search for crontab you'll see periodic tasks and crontab schedules and here is how it really explains all of the different things there's a lot of references on how you can actually go about running this and you can also go off of solar schedules you can change the time zone there's a lot of ways to do that and there's just other ways to also configure how these things run right so you can figure something like this and it's actually very similar to this but i think having it very explicit makes it a little bit easier to see what exactly is going on versus just making a big dictionary of things that are going to run and see how you can actually change the time zone in here as well so if you need a default time zone to win these things are going to run that's what you'll end up doing so the actual periodic task that i'll probably end up doing will be something like this where it is scrape products and we'll actually run our scraping so i'll just go ahead and print out doing scraping we'll actually build this out over time but for now i'm just going to go ahead and implement the task for it i'm going to comment these out and do sender dot add periodic task this again will be a chrom tab and in this case i'm going to use minutes being equal to the star slash 5 and then our task of course is scrape products and just.s creating that signature and so this should run our periodic task to scrape products every five minutes now it's gonna actually be based off of these still to some degree so i'm gonna go ahead and use this in and then i'll probably also implement a scrape product task a single product so i'll call this just simply scrape asyn and this would pass in the asyn string or something along these lines and then we would print out what that is so print a sin okay so inside of here then what i would do instead is i'll go ahead and do do scraping and then for the asyn in queue that actual list then we would call scrape asen and delay so that it is still a worker process and is delayed to another worker process with that asyn just like that and so we can actually run all sorts of creative things in here as well like we've been seeing with our you know apply async right so apply async we can pass those sorts of things in here as well we'll play around with that when we actually get the scraping function working but for now that's what i want to do and so i'm also not going to have these as two separate processes instead i would just have them as the one which is why i actually made a reference to it right here and that way i can continuous this continuously run this scheduler and based off of every five minutes running a web scraping event now this is actually just triggering the primary scraping event not actually individual ones the individual ones we might want them to run more frequently but if this starts to balloon like if we had a thousand products in here and i ran this every five minutes that means that it's gonna delay another thousand events every five minutes which may or may not be what you want to have happen it's going to end up starting to be a lot of them and also potentially grounds for something like what amazon to block your ip address from even opening up their service so we do have to be careful on that um as far as actually scraping the pages and and really how we handle certain requests so for now we're going to leave it at this and actually take a look at the later actual working process with the scraping in the future now we're going to go ahead and start the web scraping process using selenium and chrome driver now some of you that are familiar with python requests you might be like oh can i just use this and then r.text generally speaking yes on amazon i have never been able to use python requests to actually scrape amazon.com i have to use selenium because it emulates an actual webpage a browser itself which we'll see in a moment so that's what we're going to do is use selenium and chrome driver now of course if it's not installing on your system if you're having issues with it by all means for the scraping portion use google colab deep note or really any kind of cloud service that's using notebooks so once you do get the scraping portion done you can use our endpoint from our fast api along with ingrock to ingest that data that was actually why i wanted to show you in grok and ingesting third-party data from anywhere so with that in mind let's go ahead and start setting up our selenium and chrome driver data here so first and foremost i'm going to create a scraper client so inside of our app we're going to go ahead and do scraper.pie and i'm going to define a class called scraper this class is going to have a method in here just called git and driver this is where we're going to return our selenium web driver so there's two imports for this and that is from selenium we're going to import the web driver so of course we could actually just do something like that and or rather webdriver.chrome we could totally do something like that well the problem with this is it will actually open up a browser window every single time so what we want to do instead is say driver is equal to this of course so we still send this back but we also want to add in a few options here specifically for chrome so from and selenium.webdriver.chrome.options options okay so now we want to actually declare several options for this chrome driver so first off we'll initialize the options class and we'll pass it in to our driver here so one of the options is to add the argument of simply headless that means it won't actually open up a web browser for us an additional web browser if we had one open or not the other one is no sandbox there is a mode that allows you to test things and just say no sandbox removes that okay cool so that's getting our basic driver for chrome okay i'm gonna go ahead and keep it in as an instance here and actually come in and say driver is equal to none and we'll say if self.driver is none oops is none then we will initialize all this stuff okay and if it's then we'll go ahead and set it like that and then we'll just return self.driver okay so this also might want an actual class option here so i'm going to go ahead and copy and paste something here just to show that this is of type webdriver the webdriver class cool so you know typing is a good idea now how do we actually go ahead ahead and do this well there's one more aspect i want to add in and that's from fake user agent we're going to import the user agent and so i'm going to go ahead and define a method called get user agent and this is going to return a user agent instance we're going to go ahead and just say verify ssl being false and then i'll do dot random so the nice thing about doing a random user agent means that when i go to a web page any web page it's going to see it as something different each time whether it's chrome or safari or firefox or mobile it's going to be jumping around to all of those different things so we're going to go ahead and add that argument in now to as user-agent and we'll set this equal to user agent right and so let's go ahead and define that up here as get user agent simple enough so let's go ahead and make sure that fake user agent is in our requirements might not be it is not so fake dash user user agent and then we'll go ahead and do pip install r requirements.txt just to make sure everything is installed cool um so now we have our basic scraper client so let's actually go ahead and implement this in our notebooks so jumping up into our notebooks we're going to go ahead and create a new one and we'll go ahead and scrape with our with selenium basics let's just do it that way and so naturally we're going to go ahead and want to cd into our project root which if you remember back we did pwd and then we just copy this cd there okay so now from app.scraper or let's go ahead and do from app import scraper and it looks like we missed an r somewhere right there okay and restart run all there we go and here we go so our scraper itself is going to be let's call it our driver is scraper dot scraper dot get driver okay so that will do everything necessary for our chrome driver which we can see just like that so the other thing is if i wanted to get rid of this headless option here and then run this so kernel restart and run all what it's going to do is actually open up that web browser right if you're on a mac and it's asking for permissions to open this up to actually run chrome driver what you're going to need to do is go into your settings and allow those permissions so system preferences and you'll see something like security and privacy you would jump in here and you would see something to allow an app to open windows users i don't think you'll need to do that but it's certainly possible that you will in order to run the actual chrome driver right to actually run chrome driver on your system which is really just opening up chrome for us okay so now let's go ahead and open up this webpage here and i'll go ahead and say driver.git or let's declare the url first actually so url equals to that and then driver.get that url right so let's go ahead and leave that up like that and so this should get it now in my case i actually closed out the browser so i need to refresh everything and bring back headless and if you left the browser up you would have seen it change pages okay so now once the driver does get this url then we should be able to do driver.page source and hit enter and that's really all we need to worry about we are not going to be using selenium to extract the data we can use selenium but i think there's a better option for us and realistically if we were running scraping events we don't have to extract the data yet i could have saved and i can save this entire page source you can use some sort of block storage let's say for instance like aws s3 digitalocean spaces lenode object storage you know gcp block storage or gcp storage i think it is and so there's a lot of ways on how you could solve this you could save this actually as a file and parse and scrape later that is certainly a method that you might consider considering the fact that the art and science of scraping is well sometimes more art than it is science but i'm not going to do that that's just an option if you really want to be good at getting this scraping down come to think of it that would also allow us to start maybe building a actual machine learning algorithm to improve our scraping features as well which is definitely outside the context of what we're doing here but let me know if you want to see something like that anyway so now we have this page source right so i just want to show you real fast that if i try to do the same thing with python requests it's going to feel like you should be able to do it right so python requests would be r equals to requests.get that same url and then r.text to get the html data and we get this right here to discuss automated access to amazon data hit us up for our api cool um so the api is probably something well worth looking into but it's not something we're going to do at this point we're going to be continuing to learn about this so this is meant to be educational we're going to learn about it and actually scrape this data and then parse it but that means that i'm actually going to start building out my scraper a bit more now i absolutely could separate this out from the scraper class but this entire project is about doing this one thing so this one class right here we're going to build on top of it and kind of make this what we will use to actually perform the web scraping and we'll just use our jupiter notebooks to test out the things that we need to test now we're going to go ahead and execute javascript so we can have an endless scroll on any given web page so what that means is we're going to update our scraper class here and i'll go ahead and do from data classes we're going to import the data class decorator so it's a little bit easier to instantiate a scraper class first off we want the url this is going to be required next we want to say if the endless scroll is available so endless scroll and this is going to be of type boolean and we'll set it the default to being false and then we also probably want an endless scroll maybe time and we'll go ahead and put that as an integer set it equal to five okay so with that we're actually going to now implement a method just called get and take in self it's going to grab that url and also a driver so we'll go ahead and do driver equals to self.get driver and then driver.getself.url and then we want to say if self.inlist scroll then we're going to go ahead and perform in this scroll so what you can do in the driver is you can do something like driver.executescript right so this script tag is very similar to jumping into your web browser going to view developer view source and implementing things like this right so if we press up a little bit in my case you can see i can alert things i can also go to the scroll height of the document notice that these two numbers are different i can also scroll to the scroll height of the document so document dot body dot scroll height and this is executing javascript this is not that big of a deal especially if you know javascript really well but we can do it inside of here inside of this driver this allows for us to do something like this say current height is equal to driver dot execute script and we want to return document dot body dot scroll height right that seems simple enough now we're going to run a loop a while loop so while true we're going to go ahead and do driver dot execute script let's go ahead and open this up a little bit the script itself is going to be window dot scroll to zero that's the x axis and then the scroll height here okay this is the starting scroll height of course or the current scroll height and so then i want to do time dot sleep let it actually scroll to this endless scroll time so self.scroll time then we're going to go ahead and execute this again as our new height or itter height like the iteration height and then if the new height or the iteration height is equal to the current height then we're going to go ahead and break this loop which probably means that it's done otherwise we're going to set our current height equal to this iteration height and that's it and in this case i'm going to return driver.page source that's also it okay cool so let's give this a shot and go back into our jupyter notebook here we're going to go ahead and kernel restart and clear output and i'll go ahead and say s equals to scraper.scraper just like that our url is going to have to be this url and then our option for endless scroll will have to be true okay so now that we've got that oh let's go ahead make sure everything's imported and ran we can now run s.get and it should actually do the endless scroll for us i'm gonna get rid of these cells here run that it will take some time because it will definitely go through these iterations right and it will sleep in those iterations in fact i made a mistake this should say import time and so it probably even says that somewhere in here oh we missed a parenthesis somewhere in here let's scroll down and make sure the window scroll to needs a parenthesis right there okay let's try it again kernel restart run all it didn't even get to time dot sleep okay so now it's going to take some time to sleep and it's going to scroll to the very bottom now by all means you can test this by commenting out headless this argument would absolutely open up the browser for you and you can see it scrolling down and all that and at the end of the day we will get this string back now that is one of the reasons to use the actual scrolling or the endless scroll right and so i'm actually going to go ahead and define this as perform scroll and now put this in here just like this self.perform scroll okay that's a simple way to just you know separate it out a little bit now i could also have it as a boolean value in here and perhaps that would be something i would want to do but i'll leave it in like that as in a boolean value for if we want to perform the scroll or not maybe call it perform in this scroll that might be a little better okay so let's kernel restart and run all again and it should work again okay so the endless scroll this is definitely not the only way to do it one of the other ways is to just have a range of loops that you run through give it some time to wait once it's actually finished you can continuously just run this over and over and over again until sometime out happens or some range i just thought this was a clever method that i saw on stack overflow then i'm like hey that's really cool let's go ahead and do that one okay oh yeah so we've got driver is not defined and this should be the self.get driver all of this stuff so i'm actually going to pass in the driver here and in this case if the driver is none we're going to just go ahead and return big part of that is that we want to make sure that we're performing this in the scroll when there's absolutely a url that has already been accessed there's a lot of other you know tests we could do for for that one as well i just wanted to make things as simple as possible for us okay cool so now it's actually time to parse out this data now we need to parse our actual html string you know all this and we're going to be using request html to do this now it's not the only way to parse this data you could use beautiful soup and heck you could even use the selenium webdriver but i found request.html to be one of the easiest ways to do it as we'll see in a moment so let's go into our notebooks here and i'm actually going to duplicate this original notebook and now we're going to go ahead and scrape with selenium and parse with requests.html okay cool so let's go ahead and hop in here and what we want to do is we want to import request html so i'm going to leave pretty much everything the same here but then change this to just being an html string okay so let's go ahead and run everything get it all looped up in there and then we'll go ahead and do from requests underscore html we're going to import the html class i'm going to get rid of that other requests here and now i want to go ahead and say html obj equals to html and what we want to do is we want to pass in that html string here a lot of a lot of html being said right now okay so this should actually give us all of that html in there let's uh you know make sure we import everything okay and so what i can do this is why i like request html i can just do find and table so what this is going to do is find all of the table elements the html table elements inside of that string that i passed we can also do this for finding a like anchor tags and we can find href and do all sorts of cool stuff with that as well so this is why i really like this another reason is if we go in here and maybe grab the price tag here so if i go in here and go to inspect and look a little bit deeper i see that there's this id of price and our price right so this id tag here if you are familiar with css at all or even javascript or jquery there's a lot of things that you might be familiar with you can use the hashtag here and then the id and with that i can actually find that specific item and i can also say first being true and then i can say text that i think is just like so cool but remember how i said there's an art and a science to all this that is what we want to discuss now so what i just did will work for the price probably in the long run even so if i also go to the title i could probably use something very similar for the title in product title okay so going back let me close this one out by going back in here this is going to be our price let's say price string and then title string is html object find that first true and text find will always return back a list of items here unless we put first being true okay so we've got our title string and our price cool but remember back let's look at our data that we are trying to get right so back into our models we want asn price string and the title so how do we get this asn number okay so again we can be very un like artistic about it uncreative and scroll down look for asn just do command f look for asn you'll see something like this product information this also gives a lot of other information about this product so if we inspect in here maybe or actually let's go ahead and go into where the asn is it's inspecting there you know maybe there's an id in here i actually don't see an id i see product debt attribute value okay that is a class value there's two classes in here of course so we could probably use both of these but i'm going to go ahead and go off of this one this one seems like it's about this particular item and so i'm going to go ahead and assume that the asn is the obj object dot find that there we'll go ahead and say first being true and then we'll do text and then hey let's look at the asn wait a minute that's not the asn of course it's not right it's this now i already knew this because i've well tested this out i wanted to practice here and so this is where the art comes in like how do we actually find that value now if you know html well you will recognize hey that is a table this is the table body this is a table row each row has two columns one column is describing the value or it's the key the other one is the value of that key and if we keep going down this item we see that all across the board now this all of this information might be information i want to keep right so how do we go about doing that well let's get rid of this for now and say tables equals to html object dot find table so this is just finding all of the tables inside of this entire page right so every single one of these table objects here so we can print this out and what do you know there's a bunch of table objects in here now i may or may not know what's inside of these elements and actually that doesn't matter to me let's go and iterate through them so for table in tables we're going to go ahead and print out what the table is okay no big deal here now i can also do table.txt and see what that is this isn't really that useful to me it's kind of jumbled up and sure i could probably make python regular expression to uh you know extract all this data and do some crazy stuff but that's not intuitive instead what i want to do is i want to iterate through each row that's in here so the first thing i need to do is check what the children are so table the element get children now thinking in terms of html itself children are anything that's directly inside of this it's not everything below it it's just this t body right here it's not all the rows in there that's the child of t body the only child of table or this particular table is t body and we should be able to see that by this command right here and that shows me a bunch of elements that are t body right some tables have no children which means i don't care about them so now i can go ahead and say for tea body in this right here now some web pages might have other things in there maybe it's not properly formatted but in the case of amazon it is so now again i can print out what t-body is and there's a bunch of t-body now what you might want to try is say something like for you know row in or let's go let's call it the table row tr in tbody dot element dot get children and then print out the table row okay run that again and now it's saying html element has no attribute element okay so if we actually did this and let's take a look at the tea body again and said print out the type of tea body versus the type of table oops let's go ahead and print both of these it will stop after one iteration but we see this first one is a element the second one is an html element which means i probably don't need to type out dot element that's it this comes with time and intuition and using this stuff so now if i print this out i've got a bunch of table rows here okay so inside of there we're going to go ahead and do 4 column in well it makes sense to me that it'd be the same as this so tr okay let's go ahead and print out what the column is and we'll keep the table row for a moment but now we should be seeing tds in here as well as in table column and also a table header so that's pretty cool so what if i did the text for the column okay so if i print out text i'm getting a lot of stuff in here but i'm also getting a lot of nuns so there is another way to get it from a element and that is doing text content and what that does is kind of give us a little bit better of a look but it should have pretty much all this data so what i'm going to do here is i'm going to go ahead and say the row is equal to an empty list here and i'm just going to append to this empty list column.txt content [Music] with the parentheses right i knew of this method of course if you want to learn of the various methods you can print out what's in the dir and that would be a way to sort of reverse engineer that yourself anyways okay so now we've got this and now i'm going to go ahead and print out that row itself and we'll run this again and so now we are seeing a little bit better of content overall it's still not great right still not necessarily in the best spot but if we scroll down a bit i think we can actually search for asn and here is the data that i'm really looking for here right so this is the content here is a key here's a value here's a key value pair key value okay so i basically want to try and turn every column into that key value pair so basically in other words i want to say that if the length of the row is equal to 2 then i probably have a key volume pair that actually changes my data a lot now again this may or may not be true it's not always going to be true right sometimes it will be sometimes it will not be so instead of actually doing it is equal to 2 i'm just going to go ahead and say if it's not equal to 2 then i'm going to go ahead and hit continue otherwise now that i've got this i can kind of make an assumption about the rows themselves get the first value and get the second value and then print out the key value pair okay it's kind of hard to tell this right here so let's go ahead and actually add in a dictionary here so this is going to be my row values and we'll go ahead and do data equals to empty and then data value rather data key is equal to value and values that append data actually let's call this data set instead and we'll print this out in fact the data set itself could probably be outside of the rows 2. it doesn't really need to be in there right there so there we go so we print out a data set here and it's starting to look a little bit better it's kind of hard to tell based off of how it currently is because i printed out all of that data let's try that again cool so it's definitely getting closer so the content though is definitely a challenge this is not great content we need to strip out a lot of the pythonists of it so what i want to do then is give an empty string for this content and i want to try to grab that content again so this time it's going to be content equals to this and then we'll do accept pass so yet again i will absolutely be appending this content well really if it's not equal to an empty string right so that means that i can actually extract the content right here so let's go ahead and do that and now i'll go ahead and append it but i can also actually clean this up so i'll go ahead and use underscore content to just signify that i'm cleaning it and it's going to be whatever this content is dot strip so that's what i'll append now and then we can replace items in here as well but now hopefully i'll see a little bit better of content so now if i see that it is showing this a little bit better cool so there is another package that i actually ended up installing so if we go into requirements.txt we've got this python slugify package now the reason for that was for my key so we're going to come back to the top here and do from sluggify import slugify scroll down a bit and now my key value i'll say key equals to the slugify of the original key and so the key value is now a slug value which would be a lot better for a couple things for an actual key in a python dictionary but also when we try to send this data to our back end okay so if i needs to be defined so let's run that run this again and cool so now we have this data so my assumption actually is now this data itself the actual key will probably not be replaced so let's actually change this data set from a list to being a dictionary itself and now what i want to do is i want to check if key n the data set then i'll just go ahead and say continue otherwise we're going to go ahead and say data set key is equal to that value no longer appending it but now making one big dictionary let's actually pretty print this i'm going to import p print and we'll do p branch dot p print i thought i ran that i did not let's run that cell and what do you know it's much easier to see we have an asn there we've got a brand we've got a color we've got a country of origin we have some stuff about a customer reviews you know we've got a lot of cool stuff in here list price it's showing the list price which is nice but we already did actually find a price so we could probably use this price to you know verify the other price that we found which maybe isn't the same in this case it's not and we might find other prices that look at there's another price right so we have all sorts of things going on here this other price of course has a lot of data we just simply don't need which there is a way to clean that up and we will but the idea here is we have all of this data now that we've been wanting to actually extract but more critically the most critical piece is the asn here as well as uh the actual product title and our price like quite literally copying what we've got in our data model here in both places right so that is the bare minimum of what we would need to track so now it's actually time to implement this inside of our actual scraper itself now this is a really good reason as to why you'd want to use jupiter i had to jump back and forth and find things that i'm really interested in i will say that i know for sure this tables method does not work on every product on amazon.com it certainly works on like electronic items and like physical goods a lot of the digital goods don't necessarily have this same structure but notice that the table itself it's not about the classes it's not about the ids it's just a table it's just an html table and then it's looking for things that have key value pairs which are really just kind of two rows like this right or two columns rather with a bunch of different rows as the titles and then a bunch of different columns as the values so one other thing that we could think about is actually instead of looking for just tables but looking for the th like the table header and the table column that would be another way to do it right i'm not going to do that but i will say that implementing something like this is i think a preferred method to these so like product title and our price block our price these two are really vulnerable right they're vulnerable in the sense that you know what if amazon just decides that they want to call this product dash title now i'm kind of in a spot where maybe my scraper is broken and i don't necessarily know about it for a while whereas this table here this is not broken this is going to work going forward granted the data itself can be cleaned up and also especially for like the price data there's probably a way to extract the actual price itself versus trying to like you know clean everything else in there so yeah i mean i really like this method of actually just finding a table and iterating through it to extract the data we want now when it comes to all of this other data this is definitely something that you could also consider trying to scrape like maybe the customer reviews right maybe you want to track what the customer reviews are over time now that adds another layer of complexity that we're not going to cover but you could essentially use the same methodology here for these reviews because there is going to be some sort of structure to them will it be the same structure as the tables probably not right so if we look in here it already doesn't look like it's a table looks like it's all sorts of other things right and so this is the art and science of actually starting to scrape data and parsing it parsing it as the the science problem not so much so now we're going to go ahead and prepare this for our scrape client in other words i want to be able to add a method right on my scraper to handle all of this stuff so the first thing is to define extract data set and all i want to do here is get everything that's going on with this so i'm going to go ahead and cut this out here including the data set and paste this in here okay so we want to actually put in the table so we're going to actually run the tables in here and so this is going to return our data set and now we're going to extract data set on these tables okay so we need to make sure that our tables are defined i might need to re-run everything let me just do that real fast and then of course we will at the last of it we'll go ahead and call this our data set equal into that and then we'll pretty print it assuming that everything's run correctly okay so give that a moment to scrape there's our data set and it prints it out okay great so now i have the ability to do that which also means that i want something like define extract tables and in this case i'm going to go ahead and pass in that html object and it's going to be simple just going to find those tables and we'll go ahead and leave self out for a moment and i'll push this up one and then we'll just call this extract tables and then these two things are essentially the same so what i want to do here is we'll just say define extract element and value or element text that's probably the best so the element id this time it's going to go ahead and look for the element so l equals to something like this right and so this is going to have to be the element id then we'll go ahead and say if not element then we'll return an empty string otherwise we'll turn return element.text okay so i'm going to push this down a little bit or rather keep it up and then in here we'll just call this i should actually push in the html object as well there we go and so that should be simple enough there's a way to do the price string and then the same exact thing for this part cool okay so now we have several methods that we can implement on our client but there is one more thing that i want to do and that's inside of my data sets right so inside of the data sets if we take a look oops we left out the html object here so inside of the data set i have the ability to well first of all i see the key and then i have a value so in here i can actually update this a little bit more and so in the data set itself what i want to do is i want to see if a dollar sign is in the value and then if it is then i'm going to go ahead and say my new key is equal to well the original key my old key is equal to the string of key and raw okay and my new value well i'll go ahead and keep it as value and then old value is also the uh it's going to be a new value in just a moment so the idea here is to iterate through this and check if these values are there and then update our data set accordingly okay so i'm actually going to put this down here tab this over and say else right so the dollar sign is in that new value then we're going to go ahead and say data set and new key is equal to our new value and data set old key is equal to our old value so the idea is our new value we need to actually extract some data here so that extraction is going to be a regular expression and i'm just going to give it to you because explaining how this regular expression works is probably not worth it in this case this will extract a dollar amount so dollar signs something will look for that dollar amount and be able to extract it and the only reason i can do this is because there is a dollar sign in there granted it doesn't work for all forms of currency just a us dollar so let's go ahead and import a regular expression now okay and then come down here run the data set again or actually call it on this new key just passing in that value we extract it again and this time oh i didn't actually set it up so let's run that and now what we should see we've got a list price there it is let's price raw still has all that other stuff in there if i ever wanted to clean it in the future and there's our price so we have a new price that's showing up that is similar to our price string potentially the same potentially not so the price string is going to be just slightly different but now that i have all this different data and all these different methods it's time to actually bring it over to our client okay so first and foremost we're gonna go ahead and bring over this extract element text item and we'll come in here into our scraper and i'm gonna go ahead and just put it right here okay so we've got this html object that we're going to have to pass in here which you know we don't actually perform anywhere we don't put anywhere so let's go ahead and define get html object and the goal here is to convert whatever this git call is and turn it into the html object that we have here right so we've got that html string and then that html object okay so let's go ahead and bring all of these imports in here with the exception of the pretty print function we don't need that let's go all the way to the top there it is okay so now in here we're going to go ahead and say html string equals to self.git and then the html object is this right here and then we can return that html object okay so the thing about this is i can actually also set this as a value on my scraper class so html object is going to be equal to the html class and we'll set it as none and so really if self.html object is none then we're going to go ahead and grab it and set it right there and then return it okay so really i want to come down here and call scrape okay or perform but i'll leave it in a script and in this case well what is it that i'm trying to scrape well it's all of these different elements here so it's this price string so let's go ahead and bring this one in and we've got oh we already have it in there and so in here this should be our html object equals to self.git html object okay which also means that i probably want to run this first in my scraping method here all right which also means that you know perhaps i can combine these two i'm not going to do that right now because i want to leave this api intact but we're going to do that scrape html object next is getting each element in there the first one being our price string here this time i actually do not need to include the html object in the request or the actual method call i would just do self dot and get this okay and same thing is true for the title so go ahead and do title and get that title element id there okay so the next thing is extracting our tables so again we're going to go ahead and come in here and this time it's a little bit more simple and just pass it in itself so again i'll go ahead and do tables equals to self.extract tables and then finally we should have this string here this is actually a valuable utility method so i'm going to actually keep it out of the class itself probably use it somewhere else make sure the regular expressions are imported as well okay and then the data set method here strike tables or i think we should call this extract data set if it's not already that it is this time we'll do self and tables maybe extract table data set because if we ever find a list data set then perhaps we'll do that one so the data set equals to self dot extract table data set and then we're going to return some sort of dictionary so we'll go ahead and do the price string here we'll go ahead and do the title string here and then we'll unpack the data set itself which should be a dictionary okay so we should see a dictionary coming back and we can add some type hints in here for that to return a dictionary okay so let's give this a shot now i'm going to duplicate this original one and i'm going to rename it to simply just scrape client then we'll jump in to our script client here okay so we have this git here now we should just be able to do data set equals to all of that and just simply scrape at least hopefully that simply and i'll delete all of these other things i no longer need them let's give this a shot okay so there is a possibility that i need to update something related to the endless scroll and that is the nice thing about the endless scroll is it will take some time prior to actually extracting the html itself so let me just scroll up to that in the code so in this inlet scroll if i call it every time it's going to continuously scroll and it's eventually going to run on this time so it's for sure almost going to run for five seconds before we ever actually get the data so inside i get here it runs for five seconds right and the idea for me is i want to go ahead and say if the endless scroll is false then i want to do time dot sleep so if self dot in the scroll then we'll do that otherwise we'll do time dot sleep and some arbitrary number i'm just going to leave it in as 10 seconds that way for sure it's going to wait for the entire page to load and that it's going to go ahead and return back the data page source there okay oh and we missed a required argument on our tables or on our data set itself by putting the tables there okay let's go ahead and try that again so kernel restart and run all right so here's our data set there we go so one of the key things about doing web scraping is to not overload anyone else's service right so i really wanted to have these sleep times as well as the endless scroll mainly so that i'm not overloading amazon.com now it's kind of an oxymoron to say that i would overload amazon.com but the idea is if you are starting to do unnatural scraping amazon or any service will probably stop you from doing that they'll start to block the requests that you're giving because you're not doing it in a way that's like a good web scraping citizen so that in that way we want to just sleep have a sleep timer of 10 seconds that's actually not that much time especially if we're going to just be continuously doing this like every minute or so we're going to do another request for this exact same product that's kind of the idea here okay cool so now what we want to do is just do this one more time but this time actually going based off of a actual a sin so i'm gonna go ahead and copy this again we'll duplicate it and i'm gonna come in here and rename it and this time we're gonna call it scrape client via async okay so the asyn itself is always in the url for the product detail view and this is it right here we can verify that by scrolling down right so there it is and a quick way to get to any url is to grab amazon.com and then the dp and then the async now this same url setup is not the same everywhere in the world every version of amazon doesn't necessarily look like this there is i've seen great britain as well i think it's gb but in the us this is what it looks like and so the asan itself i'll go ahead and grab this here let me comment out this url save it and we'll give it a shot and yet again i mean it'll take probably 15 seconds or so maybe a little bit longer to fully scrape this but this time it should be based off of the asen and it should go to the exact same data now granted we could also print out that url to see what it is and actually go to it right so if we actually went to this url we could verify that it is the same product just looking at it right and it is i know for sure it is um but anyway so that's pretty cool we now have the foundation for our scraping client to be able to run as a periodic task and also to store it right so we want to do both of those things in the next one [Music] now we're going to go ahead and put it all together so if you remember back to our worker we actually had a query set that gave me all of this stored a sims i have in my cassandra database from there it actually called a individual scrape element and we could almost use our scraper class but we're missing something and that is the a sin being a constructor so i'm going to go ahead and set the ace into none as well as the url to none and in here i'm now going to define a post init method here and that's two underscores post underscore init underscore underscore self and then we're going to go ahead and say if self.asn then we're going to go ahead and say self.url equals to the f string of https colon www amazon slash dp and then self.acent with the trailing slash and then if not self.acen or self.ur or not self.url then we want to raise an exception and say self. or rather just say a sin or url is required simple enough let's go ahead and just do a quick test on this one go back into our script client here and this time instead of instantiating it with a url we'll do it with a a sin we'll restart and run all assuming that i actually saved it uh worst case scenario we could also print out what the actual url is for the scraper it is taking some time so i think it's actually working uh but there we go that's actually what we'll want to do because now inside of my worker i should be able to actually scrape this so let's go ahead and bring it in so scraper like that we're going to scroll down a bit and we've got our scrape base in here right so this is going to be s equals to scraper dot scrape a sin equals to a send all right and then the endless scroll we'll go ahead and just say true this of course will allow us to grab our data set just like that now we just need to validate said data set so grabbing the schema element now in our case we want to grab the schema element for well probably what the view would be so going back into main we want the same one as here just to add some consistency and then also the crud event as well so in here we'll go ahead and come into our ar worker again and bring in our schema okay come down and product list schema so the product list schema is equal to schema product list schema we can actually just say uh clean data or validated data is equal to that that data set of course is a dictionary there we go so we've got our validated data here and then from there much like we did in main we could just send it on on down send it on down okay don't you worry we will absolutely implement this elsewhere as well okay so now we should have everything working that i imported crud which i did not now if you were like me you would have had your worker running that whole time so my worker was running and it is attempting to emulate a lot of these things so i'm going to go ahead and close this down let it close naturally naturally handset on its own and then i'm going to go ahead and run this now one of the things about this is some of these asins are incorrect so it's going to be a page not found right so perhaps it'd be better if i actually deleted all of my product objects first and then actually found some real ones to start with right and to really improve everything i would actually have a another item in my scraper to find new asyns on the page that is just adding some complexity to this and we don't necessarily need it and we looks like we have an import error scrape is supposed to be scraper okay no big deal there so before i actually run that again let's actually implement this and jump into our notebook here i'm going to duplicate this this time it's going to be rename scrape via worker task there it is and so now from app import worker and our worker task is just simply scrape acen so worker dot scrape asn there we go initially we'll leave it like that and then we will have it going with celery so first step see if it even works outside of celery see if it even works something i didn't do that i probably should is actually initialize the database as well the database session i think i probably got that error already so session equals to db.get session and yep we got an engine exception here couldn't do it okay so we created that session and now i should be able to call the worker in the jupiter notebook and we should get a valid response hopefully we'll see it appears that we have okay so i didn't actually return anything i suppose i could return the scrape event itself or print out the product itself so like we do in the view i'm just going to go ahead and print out what that product is and i won't return anything because this is going to be mostly scraped and we'll go ahead and do asin and that product so the past nascent just like that and i'll give it a another shot restart and run all and while that is running i'll go ahead and start up my worker and my beat process again looks like we have oh a silly error didn't put an equals okay so there that goes and restart run all and what's going to happen is well in some cases it will not find a data set thus it will not be validated and so on so we probably still need to improve this just a little bit and in this case it worked fine but what we would probably see in our worker is if i came in here and said validated data let's just put it into a try block and say accept validated data is none and then we'll just go ahead and say if validated data is not none then we'll go through with actually adding it into our database so let's go ahead and restart this the actual salary again actual worker process i'm just going to let that run it's going to be right in the background and the way we had it set up was it's going to run you know every five minutes so it's going to take a little bit before it actually executes but that's why i have the scrape worker in the actual jupiter notebook and so i can restart and run all here and if i actually do have a worker process that i wanted to test out let's say for instance we grab this one and we come in here and just give some arbitrary one and do delay this should work and so should not delay both of these just won't actually scrape anything first one will give you a synchronous result and it should actually start scraping as soon as it gets it right task received and it was done it returned nothing back that's why it says none um so the actual method itself returns nothing right so in this case it returned nothing but in some cases it will turn something else right so i think that's pretty cool you know we can also do the same thing here just delay it it'll send it to the background and i'll run it there so i'll actually return the a sin here and maybe true here i'll return the a sin and false like it failed and then that way when it does run we'll see that okay so this is now running it's running correctly for async that even exists and now i'm actually starting to get a big old data set [Music] thanks so much for watching hopefully you got a lot out of this one and as you might imagine if we continue down the path of scraping all this data and maybe doing thousands of products we can start to see how our data set is going to balloon to be massive and it's going to track all of those price changes and then maybe you take action on that or not you can also just see some trends or some historical patterns and really just get into the amazingness that could be with this big data and of course the cassandra database service astrodb is absolutely going to be there and be able to handle it all and i think it's actually really easy to implement once you get past some of the early configuration hurdles that you might have faced even those weren't that big of a deal but i do intend to cover a lot more about cassandra and astrodb so let me know your comments and suggestions for future projects below be sure to subscribe and look forward to seeing you guys next time thanks again [Music] you
Info
Channel: CodingEntrepreneurs
Views: 5,401
Rating: 4.9835391 out of 5
Keywords: install django with pip, virtualenv, web application development, installing django on mac, pip, django, beginners tutorial, install python, python3.8, python django, web frameworks, windows python, mac python, virtual environments, beginner python, python tutorial, djangocfe2021, python, django3.2, web apps, modern software development, web scraping, cassandra, nosql, astradb, selenium, celery, jupyter
Id: NyDT3KkscSk
Channel Id: undefined
Length: 218min 5sec (13085 seconds)
Published: Thu Sep 23 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.