Read Giant Datasets Fast - 3 Tips For Better Data Science Skills

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

you found a giant data set you try to load it with python and your computer gets stuck but it doesn't mean you need better Hardware it only means you need to improve your code here's how [Music] so on the menu we have a CSV file with almost 3 million rows that takes up 11 gigabytes of space it's a very big file and today I will show you some clever techniques of how to reduce its processing time now we will compare its performance across two different systems the first one is my new gen super powerful PC and the other one is my poor old laptop which is barely even operational so if my laptop can handle it so does your computer now you can of course follow along using your own data set regardless of the shape and size but if you're curious to see how professional data platforms work let me show you how to get a sample of the same data that I'll be using for free now I got my enormous data set from Bright data.com and yes it's the same data set that you guys have voted for in a recent Community post now bright data is a powerful full web data platform that just happens to provide gigantic ready-to-use data sets and of course all the means of controlling them as well now to get our sample we will log in and we will navigate to our user dashboard then we will click on this data sets and web scraper tab followed by a click on the data set marketplace where we will search for Amazon and boom here is our lovely Amazon bestseller product data set and it currently contains 2.3 million records but this keeps changing all the time those type of data sets are constantly updated so if anything changes on Amazon it changes here as well now if you think the 2 million records is a lot I wonder how you feel about 279 million of them that's a bit more intense okay now let's quickly click on view data set and we can of course download an arbitrary sample of data either in a Json or in a CSV format but why on Earth would we do that if we can filter and customize our data first for this we will click on this create a costume subset button and we will call this subset filtered now let's say that we're only interested in the products where the product category includes Electronics okay and in addition we would like those products to have a value inside their image URL field so image URL exists true now you can of course keep customizing it further I'm just going to click on create subset in beautiful we are now dealing with a completely different set of values and instead of 2.3 billion we now have 65 000 of them and that's exactly why we always filter our data before downloading it especially if you're planning on purchasing it there is a very big gap between 2 million and 65 000 okay now let's quickly click on download sample and we will get our sample in a CSV format our fine-tuned beautiful sample let's click it and yay here's our lovely CSV file now on my end I'll be working with a complete and unfiltered version of this data set just to Showcase all the complexity around found it now let's quickly double check that this file is indeed 11 gigabytes in size wow now let's see what we can do about it now once we have a data set we will navigate to Anaconda where we will open Jupiter notebook in the exact same directory as our file now if you're not sure how to do this please check out my anaconda guide for beginners now once we are inside Jupiter we will create a brand new Python 3 notebook we will call it Amazon data and as usual we will begin with the Imports now since data is what we seek we will first import pandas SPD which will help us with reading organizing and searching within our data set in addition we will import the Thai module which will help us with measuring just how long each operation takes now the first thing we'll do here is we will load our CSV file with PD dot read underscore CSV to which we will pass the name and the path to our file in my case data.csv it will then assign this expression to data and since we'd like to measure just how long this operation takes right above it we will create another variable called start it will then assign it to time dot time and an empty set of round brackets which represents the current time then at the very end of this cell we will add a print statement that says file loaded this took or this operation took to which we will concatenate the current time dot time minus the start time now in addition we will also mention that what we really mean is seconds now the last print statement here we will verify that the size of our data matches our expectations for this we will type data shape and we will assign it to data dot shape okay makes sense let's quickly give it a run which shift enter by the way and when I say quickly what I actually mean not so quickly you will see it shortly and eventually after 99 seconds we are finally able to load our file despite this warning and the reason why we know we loaded this file is because the shape of our data is 2.3 million rows by 39 columns which is what we expect now I wish I could say the same about my laptop because if I'm running the exact same code my notebook dies and I am unable to load this file at all so what are we supposed to do now well let's start with solution number one focus on relevant data so my question to you is do we really need all those 39 columns well probably not but it depends on the nature of your project so let's say that in our case we are making an app that simplifies shopping on Amazon so when I imagine it I'm imagine a product title an image a price as well as a link to the original listing on Amazon so let's try to implement it in order to make it happen we will need to take a look at the titles of our columns for this we will navigate to the next cell and we will type data dot columns let's give it a quick run now let's quickly pick and choose what columns we need and what columns we don't need so first things first we will definitely need the final price we will also need the image URL the title the URL and actually let's also include the product category because it might be important especially if we'd like to expand on our app in the future now in order to focus on those columns we will go to the very first cell and we will add a use calls property to our read CSV method now we will assign this property to a list with all the column names we are interested in so we will begin with final price we will add image underscore URL then the URL of our listing as well as the title and lastly it will also include the category which is written as categorys okay now to make everything a bit more readable we will split this line of code into several lines there you go and before we run this cell let's make a quick memory note that it previously took us 99 seconds to run it I wonder if we can improve it now so let's quickly give it a run and beautiful we are now loading this file almost twice as fast now please keep in mind that these results are approximate if we'd like to get accurate results we will need to warm up our CPU first now when it comes to my laptop it was finally able to load this file in a record time of 5 minutes which is something but it's still not good enough there must be a better way which leads me to solution number two known as chunking now with chunking instead of loading our 11 gigabytes of data all at once we first split it into small chunks and then we load them bit by bit you may know this process as batching for this we will add a new property inside read CSV known as chunk size now on your end I recommend to set it to 10 you're dealing with a smaller subset but on my end I'm going to set it to 50 000. now since we are splitting the original data frame we are no longer dealing with a data frame object but with something called a text file reader that's why all the attributes and all the methods associated with data frames are no longer relevant to our data variable so let's go ahead and comment out our shape print statement and let's give this cell another run and wow this operation took less than one seconds holy smokes and it's the first time I can say the same about my laptop which also shows very similar results yay now in order to access each of our chunks we will need a for Loop so for index Chunk in enumerate data where enumerate allows us to iterate both over the chunk as well as its order in the sequence represented by index now if the index of our chunk is equal equal to zero then we will go ahead and print this chunk otherwise we will need an else Clause where we will break from this Loop and we don't really want to iterate over all our chunks we just want to have a nice Peak inside so let's quickly run this cell awesome this is how each of our chunks looks like we can see all our Fields final price image URL so on and so on now the only problem with a text file reader object is that we cannot modify it at all and if we want to access one of the records or one of the fields we can't because the chunk comes as is it is read only now the way to bypass it if you do need to modify something within your chunks is to convert it back into a data frame we will do this with PD dot data frame to which we will pass our chunk okay we can then assign it to DF and then we can try to print DF in the field of final price in the index of zero okay let's give it a run and beautiful we are now accessing one of our fields and if we want to reassign it let's go ahead and copy the contents of our print statement and let's assign it to 8.88 for example let's give it a quick run and perfect we are now modifying the values of our chunks excellent now the last solution we will explore is saving our modified data into a new CSV file for this we will retract from our chunk size property and also from our for Loop in addition we will uncomment our print statement and we are ready to convert our data frame into a CSV file for this we will type data.2 underscore CSV to which we will pass the name of our not yet existing file on my end I will call it modified underscore data dot CSV and then I will set its encoding to utf-8 which means that the format of our characters will be consistent now let's go ahead and set the index to false otherwise everything will look extra messy cool let's run this cell and after 60 seconds of wait we now have a brand new CSV file inside our project folder and this time we are dealing with 600 megabytes of size wow and if it sounds too good to be true I agree let's double check that we didn't mess anything up so back in our notebook let's copy our read CSV command we will then apply it on our modified data and we will assign it to new data okay now also let's copy our timing command we might as well as well as our print statements cool now in the last print statement we will modify data to new data okay let's test if it worked shift enter and boom after seven seconds we are loading the same shape of a file what sorcery is this I still can't believe it let's print let's print the content of new data okay let's do newdata dot head which will print the five topmost values wow the data exists congratulations now we know exactly how to handle enormous data sets and how to reduce them if necessary now in addition we found that the key to a successful data project is surprise surprise within the data itself it has nothing to do with the hardware and it's not necessarily has to do with the quantity of data it has much more to do with the quality and thank you so much for watching I really hope you found this tutorial helpful if you did please give it a huge thumbs up and share it with the world if you'd like to see more videos of this kind you can always subscribe to my channel and turn on the notification Bell if you have any questions or anything to say please leave me a comment below I read your comments all the time and yeah I'll see you soon in another awesome tutorial in the meanwhile bye bye

Info

Channel: Python Simplified

Views: 48,599

Rating: undefined out of 5

Keywords:

Id: x2DxiL8WOmc

Channel Id: undefined

Length: 15min 16sec (916 seconds)

Published: Mon Mar 06 2023