Top 10 Data Engineering Mistakes

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

now you might think I'm going to tell you what you're doing wrong I'm not I'm going to tell you where we have gone wrong like mistakes that we've made in the past so that we can all learn from them so that you can make different mistakes and come here and tell us about them I will say you know do this and don't do that I'm not telling you what to do I'm this is a short formulation for these are the conclusions that I have come to so that you can perhaps come to different conclusions or speed up the your path of learning and in everything whether it's data engineering whether it's playing the piano you know if you don't know any rules or guidelines it will sound terrible once you get some things in place you can provide value play as time goes by but in order to like do something really good you have to bend the rules so I'm trying to convey the sort of rules and practices that I have come to conclude you might come to different conclusions why am i standing here what am i suitable to do this well I've had the I've been very lucky in my career in a couple of aspects I work for some very good companies that are really good at these things and picked a few things up from from very talented people and the last two and a half years I've been an independent consultant and I've been helping a whole bunch of companies so I've seen more like big data environments that most people and in that many environments there are some patterns emerging things that people do over and over again and that's what I'm trying to convey to you so how do I find a mistake well a mistake is something that prevents you from reaching your end goals and there are actually different angles out there my definition of an angle is like some kind of profit like either money or happy users or something of business value that there are other goals out there whether consciously or unconsciously and that's fine but that's that's my definition for the sake of this talk and we do notice that people make a lot of mistakes this is a Gardner quote that maybe 60% fail mistakes was an underestimation I think it's like our sorry 60% Big Data projects was not an estimation maybe it's like more like 85% and I've seen some companies fail I've seen some companies succeed I'm trying to convey the what I see as a difference this is a general blueprint this is the data tech conference I assume most of you are familiar with it you have services doing things on one hand you they emit events that you collect or they have store data and databases and you collect that data put it into some kind of cold store role zone where you sort of keep it forever if it wasn't for GTR and then from there you bill like processing pipelines Sparky loop whatever your favorite thing is and at the end there's a Negro stage where you take things out of your data lake and push it to make this make it useful man this is pictures focused on batch there is a stream equivalent I will come to that later in the talk so as you start out and this is the way that people started out years ago and some Sun and but it still happens is that you you write a little batch script or something and you take the data that you collected for a particular hour and you run spark on it because spark is a decent tool and then you want to do something on a daily basis in this example you want to count account some before key performance indexes you want to form sessions because if you want Pro product insights into your into your web shop or something so works fine for a while then you add your campaign management decides that you want some more information so you add some job to the write that is dependent on the stuff to the left in order to can to calculate the impact of your campaigns also works fine for a little while but then something goes wrong on the left and the dependent job on the right breaks not too bad you can you can go in you can rerun it manually but you start to realize that this doesn't scale beyond a few yobs it's even worse with other types of failures that might very well lead to silent data corruption so much later you will discover that you didn't have all the data and you doing the wrong calculations this goes out of hand very quickly and the solution to this is something that manages your dependencies let's call the workflow Orchestrator now there are a bunch of them out there and if you only remember one thing from this talk that is pick up a good workflow Orchestrator that is the one of the keys to success unfortunately there are some some that our week out there weak in the sense that you cannot express all the things that you need and you recognize them because they are using a non expressive languages as the DSL such as XML or graphical interfaces unfortunately many of the vendors ship the weak ones rather than the good ones Google is the shiny exception they recently four announced the managed airflow service airflow is one of the two good ones and the other ones called Luigi both of these are perfectly fine tools if you're not using them have a look at them I tend to recommend Luigi to my clients I'm a little bit biased I used to work at Spotify where we made Luigi but it's much simpler it's it's easier to get started with it has a smaller scope I like simple tools they do one thing well and that leads to simpler operations and so forth but our flow has some some advantages as well so it's a trade-off all right off to number two Hadoop can do a whole lot of things it can look at the sequel database and it can even do transactions these days so that's great we don't need a sequel database anymore we can ditch our own ole Oracle database or what they want to just use Hadoop and we have really powerful tools that we can use queries all over the data Lake to bring out the data that we want in real time when we need it by querying a lot of data sets that sounds great now it's not so great because that's not what these Big Data technologies were meant for unlike the relational databases they're distributed disability technologies you sacrifice a whole lot of things in order to gain scalability and performance and redundancy now what you sacrifice is the multi-purpose distributed technologies are good at one thing typically one thing only or maybe two or three but with everything that you add you are sacrificing you know you are sacrificing quality so they had do technologies and the old ceramic technologies were made for offline processing not online processing so if you're using them for online just do as you used to do and bring in newest technologies you're actually making things worse and this sort of big data revolution that we're somewhere in the middle of is actually not about new cool technologies or lots of data yes there is some of that as well but that's not where I see companies getting most value will they get the most value from new ways of working and collaborating you instead of these peer-to-peer organizations on the Left where you have to talk to five different teams in order to get your data to do something innovative with data you pour down you have an organization that pours down all of the data and democratizes it internally in the organization which means that you can do it data international faster either in for botching the de lake or in-stream storage for real-time processing the other big benefit is working in in pipelines where you save the raw data rather than say processing it first and then saying it so that you can go back if things go wrong and you can invent new things from the raw data and this gives a tremendous iteration speed and get really powers data innovation if you get it right if you are able to move quickly so say the data in one company we decided that we wanted to do important calculation it was important that we got it right so all of the nodes that were like saving data collecting data we asked them to to sort of push all the data into Hadoop and we only started the calculations once all the day it was in their problem is that all the data takes a while to figure out whether all the data is in there and when those started failing this took like longer and longer longer and got more and more fragile so the more we scale out the worse the system rough and also this was good for reporting well really needed the good quality but if you just want to do dashboards of products in site quality is less important than getting results quickly so there is no good answer on how long you should wait so another company said okay we're not gonna wait we're gonna start processing and then we're gonna monitor the lates incoming data and when it once it reaches 1 2 percent something we recalculate downstream now this has different disadvantages because the downstream calculations are no longer predictable and reproducible which is OK for some cases but not great if you want to do like machine learning or data science experiments where you tweak your algorithm and say did it get better no the data change so you don't really know on the tangential note on a different company the people doing the event collection said well we have this service that you IPG you look ups wouldn't the all the events be better if we like decorating them away with some geo information so let's do that before we save the data that was fine we got even better events until the our PTO service broke down or did the wrong thing and we're now had some events with missing data or wrong data and that is painful to recover from anecdote number 4 we had some processing where we needed information in the database so Hadoop spoke they can go and look up things in databases to decorate the data so we just go to the production database and do that now if that if you're doing a large computation that might take down the production database but if you're doing a small computation that's fine until you need to rerun it because then the production data has changed and your pipeline is no longer producible these four stories have in common they are deviating from functional principles now if you've learned functional programming you might have picked up principles of immutability once you've written something never change it idempotency right you should be able to repeat operations and reproducibility right everything should be deterministic and whether we do in functional programming or doesn't matter but this is this these are the same principles applied on an architecture level and following the following these principles makes your life a lot easier so once you've written data sets let me change them and make sure that your your execution is repres it's worth sacrificing it for good reason but know when you are doing it and why you're doing it so the the solution to not go into in order to avoid going to the databases from jobs in production or production databases from from Hadoop jobs you can dump the databases on daily basis so for each day like you have a copy of the database that's great there are tools like spark and scoop that can go out to a database and dump it so this is one of my favorite war stories we had at Cassandra cluster serving users was a big kiss under cluster like 40 50 nodes something lots of users and we wanted that information on a daily basis and so we had a Hadoop cluster and this lot of data so we had like you know 20 then 50 then 100 Hadoop machines going out to get all of that data so that created quite a load spike on the older son or trusted because son is very scalable so it could sustain the spike the spike became half an hour long in an hour and two hours and one day the spike got 25 hours long so we had a double spike in in one hour that took down the login service the user people logging service people were not happy so they put a network firewall in between the Hydra cluster and cost problem solved and we had to figure out a better way so I wasn't paying attention a couple years later somebody wanted to do the same thing this also with the login service and they decided not to go to the database but go to the nice little rest interface where you got very clean data and of course the same thing happened like this the the this cluster brought down the login service so this you can you can have her do passes like a denial of service attack you this can happen on the egress side as well this is a recommendation scenario the people doing the recommendations they had this great new algorithm so they had computed big indexes for recommendations and they wanted to push this out to test it to the users so they dumped it all in the Cassandra cluster and Cassandra was replicated so cuz I cuz I'm too happily replicated this to the other data centers which took all of the cross-atlantic bandwidth which was needed to more much more important things such as serving users so forth so what we learn the offline is the offline environment is very powerful it's a internal denial of service attack basically ready to kill your online environments at any moment so you want to separate these two environments right in the online things are really important if things break you have unhappy users out there and they give crappy feedback they shout on Twitter and things like this so you really need to like be proactive with making sure that these systems are up the impact is high so the proper built-in it's below the offline world you have internal people like like doing business insights and so forth there are a few of them and they are quite good at giving feedback because they come to your desk so things might go wrong right you you don't need to protect these things as well in the offline world so you can take a much higher risk and do your sort of Quality Assurance on a lazy basis when things when things go wrong you go in there and you do more quality assurance there much higher return on investment on on those type of quality assurance so what I learned is to separate the online world from the offline world the middle and also from the online world at the end and have very careful handovers in between so for dumping databases for example you can take a nightly backup of your database bring up an offline database and dump from that one that's not serving any users anymore and likewise if you're pushing like recommendation machine learning models out there throttle it and then keep several versions of these recommendation models or for detection models so you can switch back to an old one if the offline world fails you and you don't need get new models on a daily basis you still have a few old about so Ali is an old colleague of mine and he decided to do an academic career you know in the middle of the big data hype in like 2007 8 or something he decided to go to academia and I said I work at Google time like Alec come and work with us we're doing super fun things no no I'm gonna go to academia and then they built this thing called spark and then he under by accident he ended up being CEO of the coolest startup one of the coolest startup ever and he says like we have clients with petabytes and petabytes of data and they all want to do machine learning and I'm sure they have those clients in fact there are companies that generate petabytes of data say they have like a billion users or they have things like jet engines that produce approximately as much data as a billion users now most companies don't have petabytes they have like gigabytes because they have uses the producing few kilobytes per day and they are the number of users is limited by the you know the number of swedish speakers or something so you actually don't need scale and your data might very well fit in memory well if you go for clusters anyway you you are taking a very large cost because in distributed systems lots and lots of things happen that never happened on single machine environment and you want to make sure that if you go into cluster so scaling out but you get value for paying all for all of these problems and costs so mistake number five minutes like scaling for imagine or engineering for imaginary scaling problems your data is likely to fit the memory on the largest cloud instances or at least a portion of your data remember the the value of big data is not about about sizes and necessarily machine learning either but about new ways of working in collaboration and enemy in the most successful project I've been in we're using spark but never in clusters we're only using it in local mode this is say a this is a company who has very little business outside of Sweden so we know we can't get more than 10 million users or something everything fits a memory only local mode works great so a client went came to me and said we have this issue that things are going slow about the day are you sure storing things in s3 is a good go strategy here and they pointed me to two blog posts and that made me figure out what they were actually doing this blog post they have a job that reads like one year of data every day and there was another situation where we were having this large cluster and there was this system that was using like a large portion of the cluster all the time and we figured out they must be doing something really important it turns out that they would be reading years and years of full raw data on a daily basis hundreds of terabytes computing something and then dropping it all so the next day they would read hundreds of terabytes again so usually you don't need to do that and it's better to change your algorithms so that they are incremental so you can process the new data and do as much as you want or as much as you can typically aggregate on the daily basis and then even if you want results on a regular base results for more than for a long logic time period you do only the final part of the computation so in this example let's say you want to make a top list of the of the most popular countries where your country where your customers are or the countries that plays the most orders and you want to do that on a daily basis or the rolling one year window if you do it the nay way you would read all of the raw data I join with you use a database and produce the top list every day if you do it in an incremental way you would each day count the number of users in each country so you join with use a database and you do the heavy computation day by day and then only in the end to do form the top list you need to do you take the aggregate this is lots of data this is a small amount of data this is compressible okay splendid so the data Lake has has like a real time variant as well it's called the unified log instead of putting it all those things in batches of files in your lake you put you put it in a stream so that everybody can consume this is the data take conference you're very familiar with it you can do similar pipelines like the stream processing that spits out data to new streams and do the same thing but but in real time is that a batch and surely this must be much better I mean you get you don't have to wait to the next day to get results you some things you think need to be quick and in particularly the demos look much better right and you've heard some some cool companies say that these this is - what the cool guys are doing now this comes with the cost and this cost is called operations if you change things in your streaming environment you need to be much more careful so if you change your schema for example you need to do it in a careful manner because if the you're spinning are new data and your downstream pipelines or downstream jobs get gets confused by the new data they will and then you will have an operational problem if you do the same thing in batch they will crash as well but batch is very forgiving because there are things like the workflow Orchestrator that will retry your failed jobs so once you fix that you recover again there are ways to deal with this this is scenario is more difficult if you have a bug and you spit out invalid data then your events will spread out through all of your pipelines and some of them will be bad and is instead of having you know 10 batches of bad things you have like 10 million events that are bad so you have to reach out to figure out which ones are the bad and try to discover myself this isn't much more difficult scenario to recover from and there are no tools to really help you here if you compare this with batch and take the operational scenario where you have a faulty job up there and it produces some like faulty recommendation in mixes or something what you do is you realize oh no the indexes are bad let's revert to the old ones because you say that you're saving a few copies of the old ones right now you fix your bug and then you just remove the faulty data sets which are I could proximately ten of them you can look in your workflow Orchestrator because it keeps a dependency tree so you know which they are you can build tooling for this but it's it's people haven't because it's fairly easy to do and then you're done nothing more operationally to do right because the word proc is straight you will see though some datasets are missing and will refill for me at least if you use Luigi with our flow it's a bit more complicated but it's still not difficult this means that you can recover from programming errors in like about thirty minutes even though they were in production and this speed gives immense power to your data innovation but you only get this a batch this is a confluence blog post on how to recover from a similar scenario in with Kafka streams it's a lot I don't know it's it's actually a really good blog post they describe very well they do fantastic work but it all it doesn't cover the pipe lesson I read only it only works for if you have a single job and not a food plan final jobs so by all means go for stream processing but make sure that you pay price in terms of operations so make sure that you have a use case for it right or that you perhaps you have a case where the data call it is not super important so you're only feeding dashboards or something that's fine now the good news is that this gap is decreasing because the tooling is getting better confluent is doing a fantastic job on pushing this to the world and so I see some things coming out of Google as well so hopefully you know like five years or something I can't be standing here and saying this all right some sometimes the data looks wrong you discover that after a while I was in a team where we really cared about the data quality really cared about getting things right and we're taking lots and lots of data from from the rest of the company and so the manager called us the sewer of the company because everything that people who were doing wrong or hazely upstream like we had to deal with at one point we saw that well wait a minute this content type is supposed to be like text or audio or video something in one day was a content type in some of the messages it turned out somebody pushed or like an experiment to production upstream and they never notice that we of course noticed so unless you are proactive with monitoring quality like this you will find problems weeks months years later and then they are much more expensive to deal with and there are four quality dimensions to care about timeliness is producing results on time that that one's not so difficult because if you don't produce things on time people explain correctness is about doing getting in the computations right testing can help you here so that one's fairly straight cord completeness is more difficult that's but making sure that all of the data you wanted to compute on all of the things that happen during the month are actually in this month's data likewise consistency if you if you have two data sets that are supposed to contain the same things like to reports pointing to the same thing do they really have the same input data and the latter two you need to monitor for there are no tools or things out there that can help you so the mistake is writing your pipelines and then be happy you need to monitor the contents and you have a good one good tool is called counters or aggregators or accumulators so whenever you code and you see well this item here it's supposed to match a user in the database what if it doesn't ask yourself right and if it doesn't pump a counter don't do anything smart just bump the counter so that you can monitor whether things match your expectations or not that's how we found out about the content i bullshitted incident you need to go a bit further than that I'm actually measure consistency between the data sets and between the records and the data sets and so forth likewise here are no tools to help you as well and the counters is one reason that where I reckon it's one reason that I'd recommend people to stay off of sequel processing for for really important pipelines because sequel makes you forget about these things there are no natural places to put in these counters or compensate for that all right things break and sometimes you do something wrong sometimes somebody else has done something and then inform you so so things break the effect here is that you become locked until you become careful oh I'm not sure I won't change this because it might break something downstream and we had a situation where we were the old pipeline was no good so we built a new pipeline and that was really quick it's you can move things you can innovate quickly unfortunately changing and removing old things takes a lot of time so we had the old one running for like 18 months or so and that that old one was a lot of burden to us so the problem is that these changes that you want to make are often in different teams so your your it's difficult to go across the team boundaries so what do you have what you end up with is an inability to move quickly we measured this in one company we concluded that if we if we collect a new field at the client the average time for this to be used at the end was like one month and this is this is a good company like very agile independent teams and so forth right I've also seen teams where we don't have the communication channels and there's no there's no upper limit to how long time it takes to propagate new data from upstream I've also seen environments where this is technically tightly coordinated and then you can move on the order of days this significantly impacts you're able to to innovators are different at the high speed the remedy the only remedy I am aware of is to do testing unturned so that you have a safe testing environment so you know when you break things then you can move fast but this is often against the company culture if you have autonomous teams the this can be difficult to do all right you've heard about the functional principles like you do imitability and you in order to be reproducible you dump all of the user data basis everywhere so you have thousands of copies of the user database and so forth and this is all great until somebody knocks on your shoulder and says what about privacy what about erasing users okay wait a minute and the most expensive engineering mistake that I've seen was most made in this area it turns out that takes a lot of effort to wash petabytes of data so you don't want to be in a situation where you haven't plan for this likewise if you allow your teams to do like to choose arbitrarily anything of autonomous teams and they are supposed to be able to deploy really quickly without connect coordinating which will not people you end up in a situation where you have variance in things that don't really matter such as time format I think I counted to somewhere around 25 different formats in one Larry Daley Lake some of them were really crazy but we couldn't change that because we were like in the internal testing and also I had this situation where the I was riding a pipeline and I was trying to use beam and I was happily writing along when the job was done turned out that the input data was actually per K and I never looked because it was like ever everywhere I said yeah yes I could have looked but this is something this needless variance is something that adds friction to your innovation speed so what you need to do there are some things that you need that you are better off planning early and I'm sorry through throwing like an enterprise buzzword at you here my definition of governance is the things that you should have done proactively so that would have prevented you from from a painful situation later and you can put anything in the heavy one and there are a couple of types of painful situations that you can be in the ones to the right or like risk situations you found that you are not compliant or you have insufficient security or whatever so that's that's a way of risk mitigation once to the left are all about your innovation speed your ability to innovate with data quickly we've been through the development speed it quite a bit I want to point out this one I've seen a whole bunch of teams that say oh data is valuable that's great I have a lot of data it's my data and they didn't want to share that they don't want to put it in lake or in the stream now it turns out that a bit of governance can loosen that up so if you say well you're not not actually putting it in the lake and then everybody can do anything on it there are rules here in the lake and one of these rules can be they need to go and talk to you before they use your data right so that you limit the degrees of freedom in order to make people want to share a few comfort chattering right so rules and degrees of freedom can in it can increase your speed so if you look back there a couple of themes here one is up in the right corner is to sort of gravitate towards complexity more than protecting business value and working towards business value the other one is to sort of gravitate towards interesting technology or new technology more than the sort of the principles of Big Data and principles of pipeline and data sharing data democratize democracy and so forth and I find myself saying to clients over and over again a whole number of things if I keep things simple scope down as much as you can focus on the value what's the value of this pipeline that we're building or this technology we're building and you can have hyper profit for each actually you think that you cannot both the these the efforts that you have if you go for the hype things I mean to go for the profit things are completely different you might notice that none of the things that I say have anything to do with technology so here are some links the upper three ones are similar talks but less technical more about building teams and organizations and so forth if you found some of the things that I said here confusing or you wanted more know about the context I suggest you go for the reading list and the last one is some of my old presentations from from other conferences questions yeah thank you very much anybody has a question so I have one question what do you think from your experience what are the 10 mistakes falls the most biggest problem in your experience well I think that the picking up the new patterns the the pipelines and the collaboration is where you have the most profit if you come from a like a legacy environment so and I think that's one of the reasons where I see where we see so many enterprise projects fail right they're just bringing in new technology and then they're just making things worse likewise this one is the workflow Orchestrator is yes it's new technology but it's super simple and it really saves your sanity right and it makes you think in terms of pipelines and in terms of the delivering value the internal agility that's mistake everybody makes right i I've seen I only see few companies actually be good at this and they are very coordinated every they have a very specific company culture so so you can you can thrive without the tree a mistake is more costly that depends on your environment right if you're in an enterprise environment this one is bound to be costly because you will set out to do it big project we I've seen a bunch of like data like projects and then go on for years without and then then they start producing value once they've sort of picked important things up so you can spend a lot of money there but if you're a start-up that's not where you start right so it depends on your context Steve great toklas purely hypothetically imagine you've got multiple data sources one of them is using one of those random time sources and they're an hour out so your data's coming out wrong how would you go about debugging that sorry I don't understand how would you go about debugging say one of your data sources is slightly dirty so your answers are coming in they're out how do you you know begin to go backwards from bad results to from bad results - too bad incoming data yeah yeah I've done that on a number of occasions in substant emore we care about the results there are no shortcuts so once you end up with the bad results you just have to debug step by step but from there you can improve right so so this was in it in a company where we had a strong tradition of learning from mistakes and doing post mortems and so forth and in those I mean these kinds of issues were always post mortems because they we cared about that that particular those particular datasets so we would do you know five whys and see what went wrong why did it go wrong how can we prevent this from going wrong in the future and we would include all of the teams affected all the way out to the to the data collection and the client right and see so you must improve here in order for us to get the get the reports right and then this also was in the company where we had a very transparent culture so the results of the post-mortems were shown to everybody in order to have the kind of culture you must have a culture of no fear so you must be allow yourself to be vulnerable and nobody must ever never ever get punished for doing something wrong and being open about it unfortunately changing culture takes a long time any more questions yeah Laos thank you very much for coming here thank a reputation [Applause]

Info

Channel: Coding Tech

Views: 11,847

Rating: 4.8200693 out of 5

Keywords: data science, data engineering, big data, data processing

Id: Hyhwem1Gyjo

Channel Id: undefined

Length: 39min 5sec (2345 seconds)

Published: Sat Jun 16 2018