ETL with Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thanks everybody for attending tonight really appreciate it my name is doug saratori and i am a panelist on mean median and moose which is a podcast focused on working with public data in canada if you're interested in learning more about it you can check it out at mean median moose.ca i'm really appreciative of hackforge for facilitating and organizing this meetup and giving us an opportunity to chat together today's talk is a brief slideshow and uh demo okay so we're talking today about um etl using python um pretty focused on the python programming language but the other piece of that is the um the process of etl that's an acronym that stands for extract transform and load and it's a pretty common operation this is one of the most uh common things that we have to do in working with data and it is the extraction of data from one or more data sources then transforming it by cleaning aggregating standardizing and so forth to make that data conform to what we want to see in our target system and then loading that data into the target system this is work that i do all the time professionally it's something that has been a mainstay of my professional work for a number of years but i'm pretty new to python particularly using python for etl that's something i've only been doing for a few months but i'm really excited about it i've learned a lot and i'm i'm hoping that i can share some useful stuff with you folks etl is is something where the the tools are really varied there are uh it's a big field there are more tools probably than any one professional could hope to gain meaningful experience with um the the tools that people use for etl range from uh simple things like text editors or excel and spreadsheet tools like that for manipulating data for one-time transformations and almost all of us have probably done um that type of etl work at some point or another all the way to the other end of the spectrum the more formal etl tools that often often use a graphical user interface for managing a complex pipeline of loading transforming and extracting data and somewhere in between are the scripting languages like perl and python and sql itself i think it's important to think about the problem that you're working on when you're deciding which tool to use for etl there is a place for all of these tools in the landscape the professional tools at the higher end of the spectrum the one i'm most familiar with would be sql server integration services ssis those tools are really useful in enterprise environments where you're dealing with large volumes of data and processes that need to execute time and time again on big volumes of data in relatively constrained time frames that's where a formal tool like ssis would shine the scripting languages i think are more useful when you're dealing with um maybe a smaller process or a more informal process i've often written perl scripts for one-off data transformations or to create one data set to work with in one particular use case and then never looked at it again and those are are great opportunities to use the the smaller tools although i think that over the years the scripting languages have become more viable as etl tools in themselves and you'll find them crowding that that higher end of the market in particular i think when you're dealing with um sorry i'm just letting a few more people in welcome folks um when you're when you're dealing with um with data that is coming from uh sources like um apis the scripting languages are really valuable for for that type of work and we'll get into some of the reasons why so the demonstration tonight a simple example of an entire etl process so we've got data from a couple of different sources that we're going to process in our python script and and then we are going to load them into our target database so we have two different data sources one of them is the bank of canada valet api if you've never worked with it before this is something that's worth taking note of the url is there the um the valet api has a ton of really interesting data sets in it if you are looking to um expand your skills and looking for some data to work with that um that that you can use uh in in demo projects or or in your own work this is a great place to start because it contains hundreds of different data sets lots of different tools and it's it's really interesting very reliable highly recommended the url again is there on the slide and also the um the formats that the valet api supports are xml json and csv so depending on your use case you may find that a different data format is good for you but we're going to be consuming json data and specifically in the valet api what we're going to be looking at in our specific use case is the exchange rate data a couple of years ago i did a similar talk using ssis to get this same data and let me just paste the url for that code repository into the chat so that if you're interested you can take a look at it and one of the reasons that i chose to do that was so that we'd have the opportunity to compare different solutions working on the same set of data there you go so if you're interested you can take a look at that code which is consuming this same data set and that might be interesting to you so you see the url for this particular api endpoint is there on the slide as well as a sample of the data that it returns the uh the valet api is a dictionary that's the way that the data is formatted there is um three keys in this dictionary the terms key that uh gives you the terms of use for this particular api then there's a nice little key called series detail that gives you the label for this particular data set so if in your application you are using multiple different data series from the valet api that key will give you some valuable information that you can use in your work descriptions and labels for the different data sets and then there are the observations so the observations key contains an array and i just want to get into some of the details on this i'm going to use my mouse pointer to point out some of the details here so again the observations key contains an array but that array is an array of dictionaries and this is a little bit more complex than this particular data set requires um but i assume that some of the other data sets provided by the api uh require this level of complexity so we've got um two keys in our observations array within each element one is the d key that identifies the particular date for this observation then the next key in this case it is this is actually um data from the monthly data set although we'll be consuming the daily data set their structure is the same the label here fxmusd cad so that's the um the monthly usd to canadian exchange rate and then that key contains another dictionary which contains in our case one key the v key which provides the value for that particular observation so this data could be a lot more simply represented again um i've got to assume that this particular structure is used for more complex data sets as well but what this all means and the reason that i wanted to point it out is that when we get into working with it we're going to have to to deal with that level of complexity with that that data structure and unpack it into a format that we can work with okay on our process flow we had two data sources the other data source in this case is an excel document so what we're simulating here is processing of expense reports it's a pretty common use case where we have a whole bunch of expense reports sitting in excel documents um and we need to make sure that they're all in a common currency so in this case canadian currency um i've taken what what would be an expense report and broken it down to or or condensed it down to the um simplest possible representation in uh real world data would probably have many more columns than this but in our case all we have is the date of the expense and the amount in u.s dollars there's an area where if you if you would like to work with this data and work with the script that we've we've got here and try your hand at making this scenario a little more complex this is one place where you could do it you could start to add some complexity to the source document and see what you need to do to adapt to that but for the purposes of this demo i tried to make it as simple as possible so we've just got an excel spreadsheet two columns date and dollar amount for the load part of extract transform and load we need a database um so again i've created a database this is the ddl the data definition language that creates and describes that database it's a database containing one table again for the purposes of this demo wanted to keep it as simple as possible but um in in the real world of course your target database would contain many more tables and a more complex structure for example um here we have one table containing the date of the expense the amount in u.s dollars the exchange rate and the amount in canadian currency in a real world database we probably would have a separate table of exchange rates and we probably would not store the calculated value of canadian of the amount in canadian in the database we would probably perform that calculation in our application but for the purposes of the demonstration i wanted to create something nice and simple so this is our database okay and um along with python in order to do the transformation work we are going to be using a tool called pedal pedal is a python library that is used for etl work it's a lightweight library the url for pedal is there on the screen both the documentation and the source code i've really enjoyed working with it i've been working with it for a few months and i find it to be really easy to use and very efficient however you've got lots of options if you're doing etl work with python and just to give you a few of those other possible options there's pandas which is a very widely used tool for data analysis with python the the functionality that petal provides is a subset of the functionality of pandas um pandas is not as focused on etl as petal is it gives you a lot more bells and whistles um in terms of being able to create graphs or do analysis work directly in your application so that's something to look at if you're interested in that type of work then there are a couple of other frameworks available the the thing i wanted to do here was not get into the details of each of these frameworks or modules but rather to let you know that there is a big ecosystem of etl tools in python and if you're thinking about using python for etl work you probably would benefit from exploring the landscape to find the exact tool that's right for your use case but for for our purposes today um we're going to be using pedal and we're going to be creating a script called etl demo dot pi and before we get into the actual coding i wanted to show you um just focus on the libraries that we're going to be using in this script because i think it's it's worth talking about one of the strengths of python is the ecosystem of third-party modules that are available for it through its package management system and even for a simple script like this there's a whole lot of of power available to us in those external modules so besides pedal which we've already touched on a little bit we're using the os module which gives us tools to interact with the operating system in the file system pi mssql which is the specific python driver for microsoft sql server that's the database engine we'll be using today um pedal will work with uh any standard python database library anything that conforms to the db api will work with pedal and it will be transparent once you've made your database connection it doesn't matter to pedal what you're working with but in order to make that connection we do need the driver in this case pi mssql config parser is a module for working with simple configuration files i included this because this is a pretty common requirement in etl work we want to take we want to take the variables that are going to be frequently modified to configure the um the operations of our script and move them out of the code and the reason that we do that is that not everyone on our data team is necessarily a programmer not everyone who's going to be working with this tool is necessarily a programmer and also you may want to create a executable package for your python script which would make it really difficult for users to be able to modify it so for that reason we use a configuration parser to to populate some of the important variables in our script the request module is the tool that we use for making http requests which is how we access apis that are out there on the web this is probably one of the biggest strengths of doing this type of work in python is how convenient and easy it is to work with apis through the request module then we've got a couple of modules for working specifically with data types that we're going to need the date time and the decimal module both of which give us some tools for working with data that we're going to need and finally the json module which is going to allow us to consume the data in the json format from our api and turn it into native python objects that we can manipulate just before we get into scripting i want to i want to talk again about those data types and maybe a good way to do that would be to show you our uh ddl and actually do the um ddl operations okay so we've got a um sql server management studio window open here and here is our ddl script we're going to execute it which creates our database and creates that table and again going back to the data types that we were talking about before so we have some very specific data types that we want to use in sql server it's strongly typed it requires you to explicitly identify the exact data that you're going to be storing which is pretty common among relational database engines of which of which uh are a common target for etl operations and that's one of the um one of the points of friction when you're using a scripting language like python that is dynamically typed versus a statically typed database system and language like sql and that is that the types in the sql server are extremely explicit to the point where we need to define not only the general type of data that we're storing but also the exact size of data that we're expecting so you can see there that the exchange rate data it's not just specifying a decimal we're specifying the exact size and precision of the decimal that we're going to be storing this is really important for a number of reasons and it's really valuable that sql server is strongly typed like this it allows us to eliminate whole classes of error from our system we know that the data that's going to be stored in those columns is the exact style and representation of data that we expect which again allows us to eliminate whole classes of error but again it creates that friction point when um when you're dealing with a scripting language that is slightly less precise when it comes to data types the dynamically typed languages don't do type checking at compile time they check the data types at runtime and they are a lot more flexible in terms of what kind of data they accept so we have to keep that in mind and that's one of the reasons why we need those additional modules um for the sorry about that let me bring back my scripts we need the additional modules for the specific data types that we're working with so we've got a nice blank canvas here before we get into the actual coding here is the any file which contains our configuration variables so these are these are pretty typical things that you would want um to move to an external configuration file uh the specific date that our script is going to start in terms of the um the data source that's that is the start date we're going to supply to the api um the url of the api that we're connecting to the name of the data base server that we're connecting to and the name of the database so those are are all things that could easily change during the life of this system in production so we move them out into a config file okay all right so let's start with um actually reading the data from that configuration file to do that we first need to instantiate our um configuration object so the config parser module has a an object called or a method called config parser which allows us to create a new instance of that object which we are going to call config and now we're going to read our configuration file i went back and forth on whether or not to include exception handling in this script because it it gets a little bit um makes the the overall script a little bit harder to read but i thought it was valuable because um it's it's important when you're doing this kind of work to make sure that you're not inadvertently loading bad data into your system and one of the ways that we do that is by carefully monitoring um all of the i o operations the input and output that is part of our etl process and in python the tool that we can use to do that is exception handling so although what we really want to do here is this one line of code which is reading our configuration file that file that we specified um we're in cap we're we're putting it inside of a try um uh stanza a try block and um making sure that that works so what the what the that structure does is basically make sure that the um the code executes the way we're expecting if any exception is raised which is an error in python if any exception is raised then the code under the accept block will execute so in this case we're going to have a pretty simple exception handler we're basically going to tell the user by printing the text on the screen hey we couldn't read this file dump whatever error messages we get back from python and then quit okay so now that we've got our configuration data we can read it okay so we are populating a set of variables with the values from our configuration file just for fun let's try this out and make sure it's working okay um so the the script so far is working well there is the url so what that tells me is that the um the configuration file was read and we can rest assured that all of those configuration variables are going to be read into the system and populated the way we want so that's good okay next we are going to get the data from the um bank of canada api so um again we're using a try block to make sure that the um that the request to the internet is successful uh and we're creating a a request here using the get method of the request object which asks for a url and parameters which we are supplying here so we're taking remember the um the url uh is there and the start date is there so we're combining those into one url and making that request and what we get back is the response object which gives us a whole bunch of ways to operate on the response that comes back from the http server we can do a whole lot of different things with it in our case we're really only interested in two of the properties of the response object that is the the status code so just a crash course in http when you make an http request like this the server will respond with a status code 200 is the code we want that's the code that means okay you may be familiar with some of the common um error statuses like 404 which is what happens when the resource isn't found and so forth so we care about the status code that we get back and then we care about the body of the response that we get back so why don't we run this code with another print statement and take a look at the data that is coming to us live from the api okay so there we are there's the output and this should look pretty familiar to you if you were watching earlier and you saw what was on the slide this is the data that we expect to see it's in the json format and everything looks like it's working well so far okay now i'm going to need to do something that isn't going to make a whole lot of sense until later um that there are a couple of empty arrays that we need to populate and um that goes back to the comments that i made earlier about the um the kind of weird structure um or the the highly complex structure that the bank of canada api uses to um to store that exchange rate data um what we're going to need to do is basically find those values in the data structure and populate a couple of arrays so we have those two arrays are for that purpose the bank of canada dates and bank of canada rates okay uh so i mentioned a moment ago the um the status code that would come back from the um come back from our request in the response object so we have an if statement here and what we're doing here is making sure that the status code is 200. that means okay and that means that we've got we were successful in our request and this server gave us back a response um so what we do with that is then we use the the json object to take that text that's coming to us from the response object and turn it into json which we'll store in the variable boc raw one thing that i want to mention is that the um that the request module actually gives you tools to consume json um directly so you could do it that way if you like i chose to do it like this consume the raw text and turn it into json with the json module but there are a few different ways to do it so the for those of you who aren't familiar with python um python does not have uh the typical um the typical sort of programming construct where we use semicolons to end a statement and and the the language doesn't care about how we choose to structure our code visually it's looking for those markers to separate um statements that's a little bit different in python in python what matters is the indentation so you may notice that the rest of the code is going to be indented it's going to be indented at least one um more tab stop and the reason for that is that um everything after this in the script is contingent on that status code being okay if the status code comes back from our api request that it's not okay then we can't connect we can't get the exchange rate data so we can't connect that exchange rate data um to anything else so we might as well not bother um and i just wanted to point that out because um control flow in python is a little bit more subtle than you might be expecting uh if if you're not familiar with the language so that's the reason why everything else will be indented so i mentioned a moment ago that we get the uh observation data in a really weird format and we need to do something to make to make it usable for us and what that what that looks like is we're going to um we're going to take that data and populate those two arrays which are actually going to be what petal refers to as a column array so there are a number of different ways that you can load data into pedal and one way that you can do it is by having arrays of columnar data so each column in a table can be represented by a separate array and then we mash them together uh to make our pedal table okay so what we've done here is um now this is another uh control flow construct we are now creating a loop to loop through every row in the observations array so going back to the slides remember that the the observations key in our in our api response contains an array of observations so um we're looping through every row in that array with this with this um for loop here that we're declaring and whatever we do in the next lines that are indented under that four statement are going to be repeated for every row in our data okay so first we're gonna work with the dates so um what we're doing here is we are taking our date array and we are appending to it which means we are adding a new element at the end of that array and what we want to do is add our date so what we're doing here though because we're later going to be comparing this data so this is a really important point um we're later going to be comparing the data from uh the the dates column in our our bank of canada source data we're going to be comparing that with dates from another data source and in order to compare them effectively they all need to be in the same data type but that's not obvious in python the way that it is in other programming languages like sql where we have to strictly define all of the types of all the data we're working with so what we're doing here is we are forcing that data to be the particular type that we need it to be so if you skip this step or when you're working with pedal and you find that you're encountering problems making data connect this is one of the things that you should think about first are all of my um are all the columns of data that i'm working with are they all the same type so um what we're doing here is we're using that date time module that we talked about a little earlier okay and we are using a um a method called strp time which takes a string and turns it into a date that python understands as a date so the um the data that comes back from the bank of canada api uh the dates are structured year dash month dash day so we are identifying that this is the format that we're expecting that data to come back in and we want to turn that into a date so um we're looking in row uh our our row of data the row the particular row in the observations array that we are working with um there's that that d key that we talked about the key that identifies the date this is the structure that that date is going to be in and we want to convert it into a date time and hang it on the end of our bank of canada dates array so all of that is happening in this one line of code then the next thing that we need to do is um we need to do the same work to get our bank of canada rates into a column array so we have that boc rates array we're going to append a row to it we are going to look in our row of data that is coming back to us from the observations in our data source in this case we now need to go two levels deep because if you remember when we looked at the the data coming back from the uh from the api um the row has the specific um the specific type of currency and type of observation that we're looking at and then that is itself a key which points at um which points at a a dictionary that contains only one key in our case and that is v which gives us back the value so and just like we did in the in this line of code above we need to make sure that we are explicitly storing this data in a decimal type um and we don't want this to be a floating point for the reasons that i stated earlier we don't want this to be a um you know an approximation of the number that we want people get really cranky when you use an approximation of the dollar amount that you owe them or that they owe you you need to use the precise value and so for that reason we're using the um the decimal module to store specifically store that data and preserve the exact value which might not happen if we use a floating point so now at this point we should have um we should have two arrays populated that um that contain the values that we got from the api so at this point we have extracted the information that we wanted to get from the um from the bank of canada api and we have done the initial transformation to turn it into data that we can work with so let's just take a look uh at what that output looks like okay so this is python printing for us a representation of the array and what we're seeing here is a whole bunch of decimal values um so there you go those are the exchange rates they're not a whole lot of value to us without the accompanying date but um but that's that's going to happen at the next stage we're going to combine those two arrays into one table okay we've done a lot of stuff so far okay and now finally um on after 49 or so lines of code we're actually going to start using pedal so and the way we're going to do that is we are going to create a petal table from those two columns and we are going to give the columns uh the name the names that that we need so um petal stores data internally in tables which you can do a lot of things with um and there's a lot of methods that are available to us to work with python tape or with pedal tables so we want to get our data as quickly as possible into that that pedal table representation so let's do that and now let's take a look at what our exchange rates look like okay so we're just getting the first few rows pedal when you print a pedal table it will just give you a a sample of the data not the whole data set which is fine for our purposes so now you can see that what we've done with all of that work that came before we've we've got now to a point where our data is in a table representation um inside of our script so now we can consume the next piece of data which is going to be the expense uh document if you remember the expects expense document was stored uh in excel it's a two column excel spreadsheet with dates and amounts in u.s so just as we did before we're going to use some exception handling here just to make sure that the the document is there and everything that we expect uh is is where we want it to be um so petal gives us some functionality that is inherited from the python module for working with excel documents you can read and write excel documents with pedal here we're using pedal.io.xlsx to read a specific excel document and to read the data from a specific sheet so yes those were my github expenses for 2020. what's nice about this um is first of all it's nice because it's so clean and easy uh to read data from excel with pedal but the other nice thing about it is that um python also has um a module that pedal consumes that deals specifically with um these documents so we're not going to get into it here because our target is a database but if you are targeting um an excel document you can do a lot of really cool stuff with it um you can put formulas into certain cells uh you can explicitly create named ranges there's um there's a lot of capability there and you can do all that within your python script which is pretty cool in our case all we need to do is read that data and we'll take a look at it just to make sure that it's what we expect oh there we go okay so there is the um pedal table because again we loaded that data into a pedal table there's the pedal table that is representing our excel spreadsheet and you can see there are the dates and there are the amounts in u.s dollars so we've got everything we need and now we have two tables that we can join together to to get the data that we want because if you remember the the target database contains some of the values in the excel document and some of the values in our bank of canada data so we need to combine them now i want you to pay attention though to this detail there are a couple of expenses a month here so this set of dates is really sparsely populated this is not by far it is not all of the days of the year and an expense report might contain multiple values for one day or it might contain no values for one day and it's really important to remember that when we're working with it because that's going to constrain the kind of operations we can do so we've got a set now of bank of canada observations daily bank of canada observation so there's an exchange rate for every day except that the bank of canada does not report exchange rates on weekends so there are no saturday or sunday observations in our data sets so the the exchange rate data set is also sparse it's also missing dates and what makes that even more problematic for us is that we can and do register expenses on weekends so we've got we've got a little bit of a data problem that we have to fix here because we have some dates that are only exist in one table or the other table okay so to deal with that we're going to need to think about that when we decide how we join our tables um for those of you who are experienced with doing database operations you're probably familiar with doing things like left joins and right joins and inner joins so and in most cases those are the right types of things to do what we would want to do is we would want to in a in the normal case if we had all the data that we wanted we would want to take the um expenses table and join the exchange rates to it to only get exchange rates for the dates that existed in our expenses table that we would do as an inner join that we're only dealing with the intersection of those two data sets but in this case that's not gonna work for us because um because of the the problem that i identified so what we actually have to do uh is um this is the the most interesting piece here folks so if you're thinking about riding the roller coaster we're ticking up to the top of the um roller coaster hill this is about the most exciting it's gonna get tonight um so we have to join these tables so we are going to do an outer join and so this is a full outer join what that means is that we are going to get a result that has every value from our bank of canada data and every value from our expenses data so let's take a look at what that looks like you may note that i'm reusing table names here so we are we're not creating a whole bunch of separate pedal tables once we know um that we're not going to need the old data anymore i'm just reusing that table name so we're replacing the expenses table with an outer join and so just to to give you a little more detail on what's happening in this line of code we are identifying the two pedal tables that we want to join the exchange rates table and the expenses table that we created and the key that we want to join them on okay there are a few different options if you dig into the pedal documentation you can see that there are a few different ways that you can join tables we're not going to get into those details right now we're going to we're going to join tables on one key and that's the date okay so let's see what we've got so far okay so again um we're seeing just the um uh a subset of what is what is in the um the table that we're we're printing we just get to see a small amount and we're what we're seeing here are the first few days of exchange rate data you'll note that um january 1st is not in the data uh either because it's a holiday or a weekend so that's not an observation that happens but january 2 3 6 7 and 8 are all there and they all have exchange rates so um we can't see any of the values in this sample but any at any point where there is both um a rate and a us dollar amount we're going to get both of those values because we're joining these two tables together so we'll see the value that was in the expenses table and the value that was in the exchange rates table we'll be seeing both of those in the joined column in the case where we don't have a currency observation and we do have an expense report we'll see none in the rate column and a value in the usd column okay so um how do we fix the problem of not having an exchange rate for every day pedal gives us a tool to fix this and it is really handy and it's called fill down so what we're doing here is just like in an excel document where uh you might grab a value and fill it down um into blank cells that's exactly what this utility function in petal does we're looking in the expenses table and we're looking for blank rates and so we know because we know our data that that means it's a holiday or a saturday or a sunday and there was no exchange rate so it's perfectly reasonable for us to use the last good exchange rate on those days and the last good exchange rate will be the one from the day before from the friday or the day before the holiday which should be the value that is just above the the blank value in our table so the fill down function is perfect for our um for our purposes so what this is going to do is give us now um a situation where every row that has an expense is also going to have an exchange rate reported so we're now we're almost home because now we have a big table full of a lot of values that we don't want which are days when we had no expense report mixed together with a bunch of values that we do want which are the days where we have an expense report and now we also have an exchange rate on that day so the next thing that we need to do is we need to get rid of the data that we don't want and so we're going to do that by again replacing the expenses pedal table and we're going to replace it now with the result of a select statement okay or a select um a pedal select so those of you who are familiar with databases will be familiar with the select statement the the um the purpose is pretty similar basically what we do is we identify a table and then we identify our condition in the case of petal the way that we supply our condition is we supply it as a lambda function um a lambda function is the fancy modern word for what we used to call anonymous functions um a long time ago and it is a piece of code basically that you pass around and can move around in your script and apply in in many ways so in this case we are creating a lambda function that is going to be past a record okay that is the record that is being passed in uh in from the pedal table so the select statement will evaluate every record and pass that record to the function so then what we want to do is we want to check the usd column in that record and make sure that it has a value because those are the only records we care about the records that have no value for usd means there was nothing in our expenses table um and we can get rid of them so we are selecting only the subset of rows that have a value in the usd table and now i think is a good time for us to take a look at what we've got so far well you look at that um we have only the dates that we care about which are the dates from our expense report and we don't have any gaps we have an exchange rate reported for every single day and we have a us dollar amount so we're we're almost there folks um and i know this has been a lot but we've we've only got a few more steps to do before we're done next thing we need to do is we need to add our canadian data column so we want to populate our target database remember has four columns it has a column for the date okay we've got that it has a column for the exchange rate okay we've got that now it has a column for the amount in u.s and it has a column for the amount in canadian we don't have that yet so we need to add it so we're going to use um another pedal function which allows us to add a field to a column in this case because we're not getting this data from anywhere we're synthesizing this from information we already have we're just adding a calculated field so this is the tool that we would use in petal to add a calculated field to our table so we're using the add field function we are identifying the table that is the target which is the expenses table we are naming the new column and then we're supplying a lambda function that um gives us the value that we want so in this case we are taking the rate and we are multiplying it by the amount in u.s dollars so remember that we got that u.s dollar amount from an excel spreadsheet and what we don't want is we don't want that to turn into a floating point number and we don't want that to be represented as a string so we are explicitly telling python that we want to work with this as a decimal the rate value is already a decimal so now we're ending up with values that we can work with so in sql server those are going to be the cad and the usd columns are going to be money data type but we're using the decimal type all the way around in pedal okay so we should now have a table that we are ready to load into our target database let's take a look and there we are and now the amount in canadian dollars is um there for us so we have a table that looks pretty much like what we need to go into our database and uh i hope you'll agree whoops i hope you'll agree that that was um actually pretty painless and simple compared to um working with some other heavier weight etl tools to do all these operations so the last thing that we need to do is make a connection to our database and populate the expenses table with the data once again um in a try block we are um encapsulating our request to the python microsoft sql module to connect to a server and you may remember that excuse me you may remember these values from the configuration file got populated into variables and we're finally using them the destination sql server and the destination database name so assuming all goes well and we don't end up with an exception because we can't connect to that database we are ready to go and populate our data into our database table and we do that with pedal again um making sure with a try block that nothing weird happens and uh we are using now um petals io function to load data in a database and it's pretty straightforward we use pedal.io.2db we identify the table we identify the database connection that we are using and the target table which is the table name that we set up called expenses so let's run this script one last time and what we should get is no errors press any key to continue so we're going to switch over to our database and take a look at the contents of our expenses table and there we are so we have successfully taken data from multiple sources transformed it according to the requirements of our particular application and loaded it into a target database that is etl with python
Info
Channel: Mean, Median and Moose
Views: 6,083
Rating: 4.9786096 out of 5
Keywords:
Id: InLgSUw_ZOE
Channel Id: undefined
Length: 57min 19sec (3439 seconds)
Published: Thu Apr 22 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.