Python Pandas combine files | Data Cleansing using Pandas | ETL using Pandas

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this conference will now be recorded hello friends and welcome to PSP trainings calm my name is rich and today I am going to teach you how do you write file combiner utility using Python panda library and G log so let me explain you what is the business requirement I receive multiple data file from a business user and in order to do any kind of analysis the part of mine is that I need to massage transform and clean data and combine multiple file into a single file so that we can do analysis however the number of multi number of source file come from various sources and we do not know in advance so you want to write a dynamic method through which we can read multiple file and combine Emma create a one file so in my business we have been using multiple your system he wanted to write a method through which we can be data for multiple ERP system my account controller extract so data from multiple source file he massage data transform data by using various technique which I taught you in a previous video and finally we want to combine this the number of file come from various sources is not known it may be 10 15 20 subject to the number of transition take place across different entities however because we do not know in advance how many files we have will arrive to my system therefore we write Diana make file combiner utility we combine these file is write it back into single file and eventually we apply various analytical algorithm and data visualizing thick needs to visualize the data this is the requirement ok let me show you sample data file so this is my sample data file in all the file which my business you can extract from er be put into one common folder I capped it over here but they use a shared location let me open one of the file and show you how it look like so this is my Sam or data file which is typically a sales order processing data with a separator or delimiter the pipe so each data file contain data from one legal entities or once Tory entities if you look at both of mine together you find it quickly differentiate this while one contained data for us a file to contains data for Australia but when I do analysis across my multiple entities I need to combine then I'm going to write a Python code for the same let me give you a quick brief how the code look like and then I cried this for practically and I show you how step pastor at the code will execute will do a dry run so we use a object over here and apply library called pond off library and geo of library genome library we would be using over here to traverse through or search fire then I declare a variable called path all the files have reading from this path and g.low function I use to search all CSV file in with 10 15 20 the CSV file would be storing mine are available for all files ok so if the tan file my array variable would contain 10 records or an index value 0 1 2 3 and so on and then finally once you read all the files you would browse through all the files and combine them so I create a dynamic array because I do not know how many files are there in advance like the dynamic array and I'll browse to all the files from an array all Fine's 0 2 and number of files it's a user loop over here there's another variable use file under scope because finally the reserve keyword therefore I use file underscore so it is same as like if I run this in terms of constant number of time and I know in advance there are 4 friends so I write like is 0 to 3 as you do not know how many times this loop work therefore I use all files and we keep reading this file by using the read CSV we are reading first file I do not want index header over here therefore B therefore we use header zero and index column none we have a separator s pipe so I used by obtained a meter you look at your radio file it's the 5 delimiter you differentiate one ticket to other by using five you read all the files one by one and I use array function called a pan so I will keep appending all fine name so the three file will keep appending all the files and finally we can't get all this file by using concrete function from panda and you can concatenate them by using horizontal or vertical I want to concatenate them by vertical therefore u X is zero also I do not want index y use index index ignore in Mexico so true finally all the output which you received from different file I want to write it try it back to one file name is all files or all data let's get started so let me check I got two source file I would be using panop I thought I'm ready I would be using spider ID it is fairly easy to use and you are free to use any ID you would like to use I would recommend to use Python spider ID okay so step one you like to bring or inherit your package found a package or kilo packing so step one we import G log which search the file and import panda and I create an object for this PD then I am going to create a variable through which I need all the files so this is a path where my all file exists this is a string variable which contain all the files the next I like to read on Sun all the files so there the genome function which help you to search what file you want to search where so the argument is path under this path I like to search all CSV files all right why calculator or CSV it means you create all the fines and the output of all this file is cap into of you know into a folder and name it as a variable called all the files right so it is showing all files finally I like to create a an array a dynamic array because we do not know in advance how many files are there I create a dynamic array with unknown indexed and new it may be to 10 15 10 20 and number of any then eyebrow to all the file one by one so I use file variable because finally reserved given therefore I use file under spool in all files right that's a beauty of using ID if you skip something it automatically recommend you the options and help you in correcting the syntax unless really all by one by one so in order to read it this file I use panda object dot read CSV these are all CSV file I read CSV each seriously takes multiple argument in our games we want to specify the file you want to read I want to read file variable file and the score and only passing to argument over here or next column I do not want index column therefore I write none and I also don't want Hatter I put header over here well delimiter is or separator this pie therefore you explicitly find pipe separator right so you read all the objects one by one then I would append these into an array so I use a patent function and it takes an object as an argument I append them one by one now finally you have a list of all object now I am going to write this into my CSV file I append concatenate them and I tend to CSV file so I clean it I'm very very vocal frame it's like a data frame object ponder of the PD contact when you can't get the first column it needs an object to which author you want to write I want to write in Harry object because you can add vertical or horizontal this you can defend you access argument I use Excel zero and I also want to ignore index ID number one index I am going to show it to you how it look like it out index the boolean value that's it you read in a frame and finally I want to write it back to one text file so we can use PD dot to CSV but what do you want to write I would write frames I write strain to CSV specify the path where you want to write the file plus a file name I want to put this as all files dot CSD all file combine and write it back into one one file it can be two 10 20 50 n number of file and I don't need to define it in advance so let's save this called file it combiner and on it and you would have value of all the variable one by one if there is no stiffness ever there is no syntax error let's read all the variable values first video values path which is fairly straightforward is a string it show the folder where all file exists second maneuver is all file it would have list of all files used in a given folder the list of all the friends the first file is Australian sales order processing data second file is us is sales order processing data right so you have got Australian USA to fire sales order processing data then you have an animate variable called chiara it would have a combined list of all the files or it's a data frame per instance so first data frame is for Australia columns and second data frame is for USA right you've got to frame and eventually you want to combine them so you would abandon and then we combine them with D F object so you write concatenate them to frame same would have intermediate value for Australian USA it's an integrated value for Australia and USA so you combine them into one object last step you write update the data into one sixty five right so this is one data all files which is integrative and tell you both of them we can test this once again what I'm going to do now once again I'm going to execute this you can move this file and copy couple of CSV file and see what happened I copy one or two more files probably you don't have a meaningful output but I just wanted to show you how this code works so pivotal instance you have done two scenes in file now you have three CSV file I just see in the country name we have data for Chile and USA and this popular the replaces country from Australia to New Zealand so I use find option and they place it Australia I use one two zero three New Zealand now I got in a similar fashion you can add four five six seven and many five as you want I save this and close it I've got three files now the internet the output which I supposed to get in it should be sum of all the four files so this is my New Zealand data NZ right let's run this code you can run this code from here or you can run this for directly from your folder structure from your filesystem as you see these files are store under C Drive Python live I did not set the reference path no friends therefore they still exist under this folder so if you just double-click this all these files dot combine and thumbs 41 by it executing all file combine and turns out the one single file right this is a all fine it's a combination of all CSV file but then for my code is dynamic and it has no difference how many source file you have in your system so now we have eight of Australia data for news- for USA and so on so that's it is a small tiny code but to create it is highly useful code with your masada your data and you're working in a file system thank you for joining today's session stay tuned to our YouTube channel in a subsequent session I am going to show you how do you write or from the same task but parameterised and creating user interface you can log in to assign PHP training and behavior solutions.com thank you
Info
Channel: Amit Sharma
Views: 4,502
Rating: 5 out of 5
Keywords: bisptrainings, bispsolutions.com, Hyperion Training, Oracle EPM, Python, Python Pandas, Pandas Data Cleansing
Id: 6VLel54058A
Channel Id: undefined
Length: 14min 26sec (866 seconds)
Published: Sun Dec 09 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.