Data Engineering Principles - Build frameworks not pipelines - Gatis Seja

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

this is here to give us a talk Gatos is with deep data intelligence I don't know what that is he'll tell us and please welcome Gatos so in 1431 a merchant by the name of John beer was traveling from Hampshire to London on his cart John had one pound of cheese and one quarter of quicklime a quarter being a unit volley a unit volume in that time on market day he sets up his store and puts his goods on for sale the first customer arrives and inquires about the cheese the customer says I would like one pound the customer says I require I would like one pound of cheese and John beer was like off excitedly I might have a sale and John the customers how much how much does it cost John says ah for one he had bought one one pound of cheese in Hampshire 15 pence he he says one pound of cheese in London is ten pence I said then the the customer is like it's a very good teeth deal cheese in Hampshire is very good I would love to have some cheese for my dinner and for breakfast tomorrow so they exchanged the goods and exchanged the money and both John and the customer are happy with the transaction after a while a builder comes and inquires about some quick line quick line being used in building or construction and John had bought one quarter which is a unit volume in Hampshire for ninety pants and he says to the customer it costs for one quarter in London it costs fifteen pets and then the customer thinks oh yes I can really build this is good quick brown Hampshire make some good quicklime I can really build my home with this quicklime or whether whatever they use it for so again this exchanged the goods for money and both John and the customer are happy with the transaction so my question for the audience is why were both John and the customer happy with the transaction all xima hands up Oh John's a good businessman John he's a very good he's a good different units so in 1431 in Hampshire a pound was not the same as a pound in London that's this tell you how he's leading to data and pie database it's it's very applicable it's very applicable same thing with a quarter a quarter was a standard measure back there but it wasn't the same in all these different regions around Britain in fact the word for pound comes from the word pondus I guess you want to know so what did John beer have to do to sell his product in 1431 he had to find in Hampshire he had to find what the measurement system was in that place what the quality of the goods were what find out the vendors what good vendors are there might be a multiple vendors what the price of those goods were in the local currency yet to know the language the dialect in Hampshire maybe have certain words that use for certain areas or some technicalities or what the currencies are different currencies have were also a problem in 1431 you have lots of different privacy currencies in Britain what what kind of storage was he going to have before he shipped off his goods at the source what were the raw laws and religious customs of that area then on the transportation what roads was he gonna use was gonna use roads when he good was he gonna go by ship to London what kind of security did he need to have to transport his goods from Hampshire to London did he have to pay a convoy of security was gonna go on a group of merchants that go together which is what they did in 1430 well what was what packaging was he going to use the transport his goods finally the destination where he had to do a lot the same things in London what what was the mentorship minute measurement system what was the quality was the price language dialect currencies were there any levies what was a storage how is he gonna store it what are the laws or religious customs of 1431 and that you'd have to agree is a massive job just to sell a bit of cheese and a bit of quick loan a massive job so what's the what's the answer to John's problem standardization the problem is Hampshire and London or don't have the same standards John had to understand the price of the goods in his destination before even buying the goods yet to have this whole planned out all of it planned out and the travel transport of information back in that time was very slow it's a risky venture for John to take the cheese and a quick line to London so what he can't really do is by the way here he can't really scale it's very hard for him to add another product what they scale he also said it's a bit cheesy this so I've lost my train of thought thank you yes so it's very hard to add a new product our product we'd have to go through this process again and keep on monitoring the process I see transport is good for me to be and what they tended to do back then is go to get in print increased profits was to go into quality because a showing quality for one good was much easier than to scale our horizontally standardization so the first so I've I've done the research here so all of these these are actual records that I found I feel that finding understanding history is a good indicator for knowing what to do because many of these problems have been solved in the past so in 1959 there's an audience by King Edgar which was the standardization of money in different air areas and what he tried to do you're someone successful in 1215 the magna carta try to standardize weights and I think it was length the biggest change happened in 1824 with a weights and measures act however I would like to read out the top one for you in 1924 it is quite late so there are still in use 25 local corn weights and measures 12 different bushels 13 different pounds 10 different stone and nine different tongues in 1924 this is after the first world war which is how the hell did they conduct a war like how did the quartermaster say I need this from you and he that from you must have been a mental undertaking but I just read out another part why was this the case examination on arson right when confronted by local customs so strong that many have survived to the present day so the problem is not a technological one the problem is a human one how can how can you convince these people that this that you should standardize the way that you transport goods and that you have the same language throughout that process so so this is very similar to a data engineers problem or the problem that you face you when you get data you get them in many different from many different sources databases FTP API s3 facets HTML you might do web spec all these areas to get your data they come in many different formats you got JSON XML CSV arcane or some weird format that somebody on some vendor create just put in their LinkedIn for file and then you have to work with and spend weeks trying to decipher understand what what actually he intended to do then it could be many different compression formats zip zip tar.gz gzip etc etc zip and tar.gz a awful formats in my opinion well we can talk about that later don't ever do it I'll just think please don't let me do it and then what you really have is you can have all these different four sources with all these different formats with all these different compression types and its really a cross drawing problem on what you really want is to get your data into the data warehouse in a two dimensional format on the left you have multi multi-dimensional data on the right you have two dimensions this is where your stakeholders are consuming your data and see more complicated it's not just a relational database we have an you can have an intercom stores you have schema on read which ones do you use how can I make a simple enough process that is maintainable enough for me to have a data warehouse which you can have a large variety of data a data warehouse as to things which at least two things that make it work well one it has to have a large variety of data the data warehouse is the modern paradigm of a library if a library only has one topic not many people are gonna go to that library and read its books the second thing it has to be trusted source of information if I'm not getting my data when I expect it to be there that I'm not really gonna use that for my systems that I produce later on so how do i what kind of way can I do this so I don't have to code it all the same way every or have to code it in different ways for every single one of my pipelines so the only bit of Python I actually have in this talk is that anybody know where that's from yeah it's a bit more complicated and simple but I'll get we'll get into it so let's start off what we have to do we have our source data and this is our data infrastructure where we'll be doing our work and we have a data warehouse which is typically the place where people consume their data right it's like the sequel queries join different data sets to getting a bit literal you have your data Lake who here knows where the data Lake is or has used the data like show of hands please who hasn't question okay Oh fifty-fifty okay so a quick introduction to data Lakes a data Lake is imagine on your computer your files and folders wait and save your data a data Lake is something on the cloud where you can anybody that you allow can have access to that data it's one central place whereas if it's on a server the server might go down the server you can't request it with API transactions but you can on a data Lake and it's meant to be scaled out and can have a lot of data on there some typical examples are Amazon Web Services s3 there's digital ocean spaces there's I think it's Google Storage some other ones cloud storage okay so then let's look at the top we have a compute layer and a storage later right this is how we're gonna work through it the first process is you want to extract your data now this by far please if you're gonna remember anything please remember this bit right there when you extract your data save your data in its raw form somewhere in your data Lake it can be in your folder file system but save it in its raw form that means if you get a HTML website your web if you're scraping a website you should save that website in its raw HTML form to get it later why is that important so if I'm web scraping something and let's say my my so typically what people do they extract the data they or they they scrape it on the fly from the live web site and then put it into the database but let's say that process is broken and I'd scraped something that's not actually correct if I don't have the actual saving extracted file I can never go backwards in time that data is always lost I'll never have it again so you should be saving your data as soon as you get it the next process is a transform and Loewe process this is very simple as it goes over the cogs here are compute and the lower level is the storage yeah so which is where you take your data and you transform it and load it on the fly right into your data warehouse to be consumed and that is a very good pipeline it has very few nodes the data is going from left to right that's it's pretty good however sometimes what happens is you might need to transform your data and save it as files again right so this is where you have another transform phase and after that you load that into your data set however this is not as good as the first method because let's say that I have a problem in my data I'm looking at the main table I'm querying it and I see are these rows really weird now why is it weird you have to try and investigate back into those previous nodes where that raw original format is so if I have a transform layer well that's not the original source of the data I might have to go back to the extracted layer having like like this is much easier to find where the original problem was does that make sense so far just a nonce yeah okay so that's a good framework however you're not you don't have you're not sure whether you're the data you're getting is correct or not something throughout your process could be incorrect and the data that you can stakeholder is getting is wrong it's very important that your stakeholder gets the right data if I'm buying a product of the supermarket I shouldn't have to look inside the product to see what what it is in there I should have the the confidence of knowing that the data I'm querying is fine I shouldn't have to valid the end-user shouldn't have to validate that data so we add a validation stage of the extract phase so let's say I'm calling an API and after let's do a web scraper I'm scraping a website I expect on that website to have these fields but this data exists that table would exist in that HTML website if it doesn't then save that data into an extracted failed area so you can have a look at what the problem was and change your or your functional methodology to understand how things have changed throughout the process next thing what you should do is validate your data before it's given to your stakeholder or before you consume it won't for whatever reason have a validation put your data into a staging table then have a validation function or validation sequel query that you run on your data that says ah is this data like the one I affect it to be so there's two real areas I've seen where you can validate you can validate on their on like none like that is the schema correct this schema on read or we can validate on the business logic that happens in that data that you get so for example I should be expecting these three values in my data are make sure that that data is in there know if it's no then you fail it if it's yes then it carries on and then you need monitoring and all these different processes just say where where is it breaking where is it working and that should always you should always be informed in that in fact what you should be doing is you should have a this goes more into data as a product the final consumer should be able to see the monitoring that happens when your data when they can fumarole your data they should be like oh this this has been reliable for a year two years nothing is broken your managers should be able to see that your processes are working well and you should be able to see that your processes processes are working well and where they fail if they do if they fail it's not your fault it's not always your fault now that the data is dirty and you have to think of a good process to clean that data and when it failed you go back to it and you clean it you understand what the problems are so as a result of this you have two situations as time goes on this is what you like to do you like your data to be trickling into your your database but let's say that we have a validation fail in this area those two situations can happen we can either so we get our data we stop it and no more data goes in or it failed some number of processes but we get snippets of that data in which one do we use do we use the but the middle one or the bottom one is the middle anybody wanna elaborate why middle or why the bottom please possible but you could you could find out with a validation sequel pretty or yes you end consumers correct they won't know where the holes are they'll assume that the whole data is fine but actual fact you can either either it's just the stakeholder should be aware of which one you're using when you publish that data if they're if they're happy with writing their sequel queries that they could be gaps in that's fine but they should be aware of that process in general yes the middle one is what you should be aiming for so what happens is and you go back it's failed you check what the problem is you fix the code you improve it and then your data continues going forward or you go down to each little gap and you fill it in so let's go through some principles of what I've described here so far so the main important thing is understanding your data consumer what what by far what do they want do they want streaming tough how quick do they need their information and their insights do they need it in seconds they need it in hours do they need it in days that will affect the signal you don't use if you look at the diagram I haven't put really any technology on there I've just put cogs all right after an s3 bucket but you can use whatever you want for that it has just been from principles building that up right understanding your data consumer and all on your data will determine what you use during in those toc processes keep your data in its raw form I think we understand why we don't want to lose any of our data don't delete or move or move your raw data once it's landed that should be it it should be very little time that you go back and try to fix the problems in your data your transformation of your data should happen over all of time and not just at the present moment you should be able to go back throughout your data and say like ah from five years ago up to now my transformation function extracts the data I want for my extracted layer into my main table perfectly well and that is the date range aeneas problem really is to maintain that transformations you do not want to maintain other things just that separate out your extract and transform load process you don't want one to fail in the other to continue or you don't want that you want them to be separate processes that happen by themselves minimize the number of data and compute nodes if I have too many nodes in my system I have more chance for bugs to crop up more error to happen so by having only two I have a list of a chance of anything actually had going wrong that won't make your et ETL a cyclical which means data should only flow in one direction I've seen some companies have databases which refer to other databases and they cycle around and it back to itself what is the true source of your data anymore if it could actually be lost you have no idea evening databases how you structure a database you should if you create a new table it should be an a cyclical nature it should only go in one direction a validate your data before it goes to its consumers joining should happen at the database level in my p.m. that's what sequel is made for your pipeline should get the data in there if you want to join it with something do it on there and then monitor your data you need to know how well it's working this is very similar to what what - what John had to do yet to understand the data understand his product understand where his destination and where I was going - yeah under he didn't you shouldn't have to delete or move his date or his product too much around lots of different areas he who was looking at he was validating the data to make sure that all that is adding his product to make sure it was a good enough quality to say to give to his final consumer he was separating out his ETL so he was buying the goods transporting them cooking him somewhere else and he probably had infrastructure in place for all of those things he was joining his data the database level so that means that he was maybe giving it to a wholesaler or supermarket so I could be joining it with other goods as well so final thing I want to leave with you you should think the date of data as a product just how you go into a supermarket and you buy something you should be able or even in library you should be able to search for something and find where that data is without having having to investigate that data and look inside that data for what it actually has yes that's it my company is deep data intelligence we've got us online what's my LinkedIn profile these are the references I have used and found about John Pierre who was actually a person in 1431 [Music] [Applause] so yeah thank you for that got his serve no head very interesting talk so if the requirement or the customer isn't sure of what data they would need in the first place is there any area of the pipeline that we can work out anyways so just I think sometimes we don't know what the data what the requirement would be is any other areas as I say without having access to the data first in the first time what are the what are the things that we can do anyhow just prepare for that okay I would say if the customer doesn't know what data they need before they consume it it's very hard to know if I think of an analogy if they walk into a supermarket and I don't know what they're gonna buy how do you think are they gonna find what it's really the question that they have there's if they have a think about what question they're trying to solve and then find the data from that question that could solve that question for them I mean you've got a large variety of data sets that exist on the internet and exist data vendors that can give you their data and I think going to them and understanding a bit more about what they're trying to solve before looking for the data there will be a better option if that makes sense any other questions sorry if you read seeing like a state by James D Scott which you may type may have it goes into a lot of the fascinating history of the fight to impose standardized waste and measures and it turned out that a lot of the opposition to standardized weights and measures was simply to avoid things like uniform taxation across the country so they're actually kind of arbitraging and misreport and all that kind of thing is there any parallel to that in I think so yes like if you make a process easier if you automate the process there'll be less people doing jobs or be less jobs for it they'll be less data engineer engineers doing the work so yes the more standardization frankly I had a very good talk that said we should we this is this is something that we shouldn't be doing what we should be doing is going to the moon frankly I feel like I'm a postman for the Internet I'm just getting data from one place to another it should be such a trivial thing like screw it I want to do more interesting things than this right this is this is to get an insight very interesting talk I was just wondering about all the schema that you showed often I actually find myself as a stakeholder because as a data scientist we often are stakeholders of those data because we have to query them that I have to investigate exactly this and do the validation or suggest suggests the validation and all you're spoken about makes very sense my curious is all about the very end when you talk about the main data to the reading database do you envisage at all an additional validation stage where you can check actually that what you query that the query itself gives what you would expect so I normally when I'm making data engineering products I do that at that stage so I will make union of the two data sets and compared that if I run a query that gives me the same result as you should be getting out at the end so even for dinner and finally just a joke do you have any recommendations for English people in terms of adopted European units of measurements next question I wanted to ask about the monitoring step yes so usually what would you recommend as the best practice to monitor after each and every step or monitor at the end after everything is done if you start from scratch yeah airflow a good way to go onto your data sets through that use Jenkins to do it that's good in my opinion you can't have monitoring that so in Jenkins you can create Jenkins files which have certain blocks of code you can have any to tell you where you go through those blocks when it when the code fails and for what why it fails if you're writing your code in Python they have a lot of exceptions create your own custom exceptions for different problems I normally have for example a validation function that will spin off different custom exceptions based on the problem that that happens so you can find out what the problem is much more easily than investigating what the data issue is when in terms of monitoring I would say airflow was a good one I hear there's not much documentation on the airflow wonder what this but I just wanted to like take a suggestion whether say for example you are doing an extract step so do you monitor that extract step at that time of implementation only or hold the once whole pipeline is developed then you go to the monitoring step you saying if you has an extra step no I'm saying that say for example you implement extract yeah step in the pipeline yeah so do you after implementing that step apply monitoring or at the end of the whole pipeline development right there this if depends on the time constraints you haven't building your code you should build it throughout the process of when you're writing extraction functions in fact what you should be doing is even standardizing the way that you extract data so what I do in my what I've taught my day to endears and what I've implemented is fold all these different databases I have created Python code that extracts data from databases FTP sites API s3 in a standardized format you give it some configuration and then it will spin out the monitoring for you or the validation stuff for you during that process so you don't have to code it a thousand times for the same thing thank you better if you go back to the process map do you do most of the data cleaning and by data cleaning I mean here like requirements for data types in the extract phase or in the transform and load phase that depends on what your data source is so if you're consuming from a database and it's not my heart it's hard to read a JDBC connection if you're doing it from an API it's much easier to do so you can't always do it from all data sources but you should try to do as much as possible in fact databases are quite different data source by themselves what you should be hopefully doing is connecting straight to the database and that's sort of your problem yeah sometimes you can't always validate your data [Applause]

Info

Channel: PyData

Views: 76,688

Rating: 4.9214807 out of 5

Keywords: Python, Tutorial, Education, NumFOCUS, PyData, Opensource, download, learn, syntax, software, python 3

Id: pzfgbSfzhXg

Channel Id: undefined

Length: 29min 57sec (1797 seconds)

Published: Tue Apr 09 2019