The Highs and Lows of Building an Adtech Data Pipeline | TripleLift

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] super-quick Who am I I am Dan Goldin I've been at triple lifts for five years I joined when we were 12 people we're about 200 now and we ran at the company that's doing real-time bidding and I'll get into exactly what that is the reason I'm talking years because I was pretty instrumental I would say writing many of the versions in the past that are currently being fixed by others to correct all the mistakes I've made and we'll be covering in detail and my goal with this presentation is really to talk about really the narrative of the different steps our pipeline took so it's evolution because oftentimes you will see kind of the end result and you won't really understand or appreciate like what were the motivations to get there you won't really realize like well what sort of missteps occur down the way so I'm trying to be a bit more transparent with the pursuit with the hope that at least others kind of see the trade-offs we have to think about see technology's choices we made at different points in time and obviously at the end if anyone has any questions anyone could ask so it's a pretty simple agenda I'm gonna spend maybe about 5-10 minutes explaining like attic because it really sets the context for the the rest of the presentation because if you don't understand the challenges like with an attic of like what does the data look like how much is the data the rest of the presentation doesn't really make a ton of sense so I'm going to spend some time talking about that I'll talk about the evolution of the various data parameters we've had it's gonna be covering like five distinct pieces keep in mind it's I'm gonna make it seem very black-and-white it was definitely not the case there's definitely overlap from one version to another version because that's really what happens in your billing Oh like big data systems and then the the last bit is talk a few lessons learnt there will be less and sprinkle out throughout from that particular version of our pipeline but I'm gonna wrap it up by saying kind of three I would say like Universal or like holistic lessons that we picked up just along the way that apply to every single thing or at least some of the lessons we learnt along the way so pretty simple talking about like data in that if you do a simple search on Google for a tech big data you get millions of results there's obviously something there they generally tend to be synonymous you can't really consider yourself out saying a tech company if you're also not working with big data so the questions like like why there's a very simple example if you load up the New York Times home page with this chrome eccentric old ghost story it'll tell you all the analytics and tracker code that's running so just by loading up the home page of the New York Times you see 39 companies code being executed on that page that have there are New York Times they're really just vendors that are collecting stats like who is the user how long did they spend on the page yada yada yada yada and it's like well why at some level begs the question of why are all these companies collecting this data like what's so special about it simple very quick simple answer auctions real time bidding auctions and I'll do a very quick overview of exactly what that means done here so super simple if you're familiar with a deck this won't be you consider us a refresher if you're not familiar without take prepare to have your mind blown what happens is every time you land on the webpage that's rubbing advertising the browser sends an ad requests to an exchange doesn't really matter what the names are ignore them it basically sends this adder goes to a server and it contains information about you the user like what state are you in what time is it what browser are you using an exchange it then sends it out to a couple through these companies called DSPs that have like basically code the proprietary bidding algorithm that'll say well oh it's Dan he's looking at the New York Times using this browser how about we serve up an ad for a Toyota Camry and then they decide how much is an impression to me worth they submitted up stream back to back to the exchange and the exchange runs the second price auction to determine who is the winner of this auction in order to then serve an ad back to the browser to have a rendered and the user hopefully sees the ad clicks it engages whatever the case may be so this is happening if you really go to the website you see the amount of ads they have one of these is happening for every single ad so there's a ton of data really being collected throughout the process right both for the ad requests being sent out as well as the bidding behavior as well as all the data people are collecting around engagements so there's like just a huge outside like pool of being collected by like every company across the industry and something that's I would say even like super interesting is all this literally does happen faster in the blink of an eye like on average about like 200 milliseconds from the very beginning from the time of the request is made to an exchange back to in surrender so there's obviously a strong need to do things like at scale very quickly so that's a very quick overview of the way real-time bidding works and sets up the foundation in particular talking about the data side of the business it's useful to think about what data is actually collected at each step like in a pipeline or like is being generated at each stuff in this auction so like very simple in the case of an ad request it's before any buyer has seen what the information or it has really done any evaluation so at this point all you really know is really what the user or the browser sends over so you have IP address which people use to their camera the geolocation you have the user agent which is used to determine like the browser the west's device you have a few more information a bit more information around like user IDs if you've seen them before you may be able to say oh they like have seen these ads so there's generally information about that particular ad request the next step is the bid request that gets sent to these DSPs and they're generally similar but there is some salties where different companies have different protocols there's slight information that's modified depending who you're talking to generally you want to collect this information just because you want to really dig in though identify some issues so just one of those things that just adds more more volume lastly once you get the bid responses back what do you know you collect information about like the brand the price you collect about the ad itself is it an ad for going back to our example is an ad for a Toyota is an ad for Samsung you're really just getting information about the brand how much they're willing to pay and then once it's once the exchange runs the soft in terms the winner like it's not always going to be the highest price because if the particular website says hey I don't want any alcohol ads on my website if a buyer wants to serve an ad for a Budweiser is gonna get rejected and it'll go to the next highest so there's a variety of business rules that this exchange applies in order to then send response back to the browser and then lastly as a bonus you also start collecting engagement metrics which can only happen in the browser like did the user click was it in view if it's a video how much of the video did they watch so all this information is being collected really for every single or nearly every single event that the kind of the industry sees so this is a bit of a stretch but it helps me think of what how to think about the problem if people aren't familiar with the Drake Equation it was it was this idea of how do we estimate the number of alien civilizations that could communicate with us and you look at the equation it's really just a factor of x and if others together like what is the rate of kind of universe is being created well how long are they active for with the general idea like the parallel to me is it's a ton of different like factors that you then multiply together to come up with like some sort of number and it's really this idea of than just blowing up sort of your I would say cardinality and dimension space so these numbers I'm gonna share our specific to triple lift and you can consider us being I would say like a small to mid-size add to company so companies like like Google Facebook they're in advertising their scales are like order magnitude probably multiples or arm active bigger than what we do but also we're not like the smallest company so like take of it as a example what roughly a 200 person at that company sort of experiences so high cardinality what does that mean if we go back to the previous kind of slide when we're discussing the data pieces being collected you see that there's a browser you see that there's OS you see that there's region and see the buyer so all these we have roughly 50 of these I would say like unique dimensions for a particular first thing you know add request so and you can imagine if you're doing data analysis that's a lot of data that you have to like slice and analyze and collect over you couple that with this idea that there's a lot of volume we're doing four billion ad requests every day right so you take the 4 billion ad request with these 50 dimensions and some of the dimensions themselves are a high cardinality right there's 50 states if you're International there's a lot more regions if you're looking at browsers did I think how specific you want to get you have like 15 browsers if you're looking at a number of campaigns a number of brands they're like hundreds of thousands of brands so each of these combined with the fact that there's four billion of these happening every like like seriously adds up on top of that you take a look at ad tech and fundamentally it's an extremely low value like each of these individual ad requests is worth fractions of a penny so and this right here isn't but we don't always win in the auction so what really happens is oftentimes you get an ad request you submit a bid if you don't win you're still paying for the processing you're still paying for the storage you're still paying for all these things you've had to do to that individual event and if you do win it's really worth fractions of a penny so their ultimate idea is how do you do these things like like when you think when we're talking about a data pipeline it's how do we still make it successful or accessible to our users while at the same time being cost effective at the same time being able to like just handle the scale so I'll dig into like our evolution of the thought process and kind of how we've approached it in the technologies we've used so this is some numbers I'm gonna go read them but if you're familiar we're like the various bits of technology here like I think some of the numbers will stand out probably the most interesting one I mentioned is the four billion ad request a day but that really means we're doing a hundred forty billion bid requests because we get an ad request and then we fan it out to a bunch of different companies and that sort of multiplies to the effect we're doing we're collecting important parties we're collecting about three hundred a thousand events a second like some of the events we sample just because the loads too high others we collect every single one so I'm sure it's gonna be recorded so you can review the numbers after I don't want to read every single number but just suppose you frame the size of the problem here it's pretty big so now talking about the actual evolution of the pipeline and I'll dive in and or before I dive in I kind of want to explain how we think about the data pipeline and this is the framework we'll use to really look at our evolution this is pretty similar to like this ETL analogy but in the case of like at least maybe in the attic version I view it as this idea of there's some events being generated and right now there's sample technologies that are considered to be within each group so there's Kafka for storing the events there's the processing which is you realize we've we're getting four billion of the events a day like we need to do something in order to transform them so the team ETL in order to make them and then store them in order then make them accessible for our users so and I'm gonna each of the following version of the pipeline we're gonna kind of leverage this idea of yeah the event collection the processing the storage and then the access to think of how our system evolves so v-0 it's purposely zero because it doesn't really deserve a real version and it's got information highlights in quotes because I would say these aren't really highlights but the key idea here is when we started so this think of it this as five years ago and I'll give a timeline as well then just to put things in perspective but like this is a symbol of a pipeline as you can get we didn't want to collect every one of these events we've seen it was too expensive what we do is in the browser we just have like a little function that says hey if math dot random is less than 0.1 send an event off so this is basically saying sample events one out of ten we set them to a little node application on the back end that node application would do it's a tradition in memory it would write these aggregations to my sequel every few minutes and then we'd have another job that would run daily that would take these samples kind of mini eggs and roll them up to a daily level so that's simple each of these four steps Maps exactly to this idea like to this kind of reading what we discussed and that's that so and this is the framework I'll kind of take I think through all this so why did we do this it was extremely simple it leveraged all the existing technology we had it was not a lot of work and it really gave us what we need to do at that time like at this point we're a team of a dozen people like we're not gonna be investing in all this like big data infrastructure just because there's just so much overhead both in the both and like setting it up but both in the cost of running it uh-huh so next like what challenges do we run into it's pretty obvious when we had any issues we were investigating off of sample data so imagine trying to do any analysis where one out of you missed 94 than your events 99% of events so that's kind of chances we run into we had a ton of event loss because it was if there was any hiccup in my sequel like that insert thing that insert failed and we weren't able to collect that data and also the non scalable like this entire was relying like once equal customer no application we tried spending it up more up and fundamentally just didn't really work for as we grew larger at the same time like one key got pretty far with a very simple system like I know there's a lot of went right now especially if you look at like just amount of tools out there there's a huge amount but like the realities you could get pretty far if you're willing to make some compromises and it's only when you realize you have a problem do you have to think about or taking it to the next level and there's a little example of a error message we would receive in 2015 when like an insert failed right you get these email to you and usually they come in batches because that's all you'd have one failure you'd probably have like maybe a couple hundred within the span of a few minutes so we get this giant email someone have to copy this query paste it into a Brett and paste it into my Segal client running it and like that would fix our data pipeline so this is a like I would say how simple things were back in the day so next version and this is really this idea of moving away from really the sample ID and starting to collect every single date event and really building I would say the start of a modern data modern data infrastructure so what we would do here one capturing every single single every single event we're starting to use Kafka which is really I would say like the industry standard for collecting like log level data it's either Kafka or if you're on the cloud like a double issues Kinesis but generally I don't think anyone uses like everybody static anything else cuz copy just has the scale redone see the distribution to like make things work we also started using redshift with a few hacks in between and I don't necessarily want to not sure I have time to think into each one but if you go to office hours like I definitely share it but generally we have the data going in Kafka we use this application called say core which is which was open source by Pinterest and all it does is it reads from Kafka in a variety of different protocols so either JSON or pro buff or like thrift in our case we use protobuf takes that data from Kafka converts it to another set of output format so we use Parque uploads it to us three and then it takes that data that used to be in Kafka elbows to s3 and then you could do whatever you want with the furnace city what we did in this version is we hack the hell out of sake or so after it also to us three it would also publish a message to rabbit and then our we had a separate process I hae the ad request date has been uploaded or the click date has been uploaded read it from this s3 location loaded into redshift and then we had a whole separate process that would take the data this log level data from redshift and then use it to run act jobs in order to generate Ag tables that were in redshift once again so use here redshift as both the store for our log level events as well as the store for our AG tables as well as the actual like processing engine so in this case like the redshift sort of became the bread for what we did we still for whatever reason we'll talk about it later but we had to we built like a home-built scheduler system I do not recommend it you should definitely use something that's already out there but yeah so when we did introduce a few technologies here and this is really the start of decoupling to various steps of a data file and really making something that is much more modern much more in tune with what people I would say normally think of like a Big Data kind of transport big data pipeline but at the same time you also see the complexity here in the past it was very linear right it was like four boxes data from one goes to the next go to the next here you're already starting to see some branching you're already starting to see systems being decoupled and like that's in general I like in general as you get larger and as you build like a real life system you start to decouple systems it makes it easier to develop it makes it easier to test it lets you I would say introduced fault tolerance into your system like in our case if any of the add jobs failed or the log like the log loads failed we because we still have the data on s3 because of sake or like we didn't really have to worry about it but our pipeline began then suddenly delayed by a few hours or maybe a few minutes instead of actually having to like actually lose data like which is like the big no-no in the industry yeah so like pretty simple like why do we do this really it's we needed to move towards like level events we just couldn't really work with sampling anymore we obviously wanted the resistant or the resilient scalability like what channels do we run into one a big one was when we moved to redshift we realized that every single piece orab time we need to enact we have to do it through sequel and sequel gets you pretty far but we couldn't do anything that was statistical in nature or we found ourselves working over backwards to do like queries that use like nested queries to used window to do things that would have been much simpler to code but because we were on redshift we really had to use sequel for that there's a the next two are pretty interesting like one is what ended up happening is as the as our users of both internal-external saying hey we need this dimension exposed we need let's say we need a table that has domain we near the table has browser we need a table that has region what ends up happening is you end up having like these aggregate tables per specific use case so by the end of it we had maybe about like two dozen three dozen tables designed from different use cases that customers wanted and well it was nice from the standpoint of performance it was got pretty annoying like like fundamentally over it and we were thinking for something is there one way for us to just put everything in one table that's actually cheap and performance that could give us the problem that could still give everyone like the ability to like slice dice the way they wanted to yeah job scales really always a problem just an example of something we ran into as an example of like a job scheduling issue is once we have too much data what we ended up doing is we we started having like daily jobs and no crap we have too much data let's switch him to hourly what happens if one hour fails oh we run that hour but that hour suddenly is linked to the day and there's this other act that's based on this other eye and we have to do all this I would say like like home built so it was not a fun problem I would probably say we spent more time than we should have I'm just doing the simple stuff of like a wiring jobs together yeah and lastly what we realize like why we ended up overtime moving away from his redshift just got prohibitively expensive we initially stored it like maybe two weeks worth of log level data than a week then three days than two days just because we couldn't store all that while it's still being kind of price-conscious yeah and then another big lesson we force a core way back when we never touched his since so we really didn't get to take advantage of any of the new kind of solutions or new performance of moments that came with right and because we were like let's not void breaking anything we kind of kept it the way it is so I would suggest not doing not forking open source code or if you for contributing back to the main repository so a continuation of like a newer version this is where we ripped out redshift for the processing and replaced replaced the processing part with spark which is really a way to write distributed I would say like a computational code it's I don't know if everyone's familiar with it but generally you're writing code neither like Scala or Python they also have like our code and it spins it up through a cluster and it really works on this idea of like resilient data frames where if any one node goes down it will replicate it to another node and do really a series of Map Reduce operations to kind of do the actual data processing the nice thing is it took sequel so it's actually very easy for us to port our code from redshift to spark because spark has the ability to write sequel and then we also replaced not all of them but some of our redshift a bulk of our redshift acts with this product called druid and drew it is this really large scale analytics datastore it was created by coming called Metamarkets which has since been acquired by snap but what it's designed for is very similar to the way I think about is it's not based on sequel at all it's really designed around optimizing one single table but the way it works is it does it stores it in backs up on this three it stores the memory through common format but it's key advantages it's designed for more of these interactive analytics so it's not necessarily designed to give you perfect results but it's designed to give you like 99th percentile accurate results with by but able to do a much quicker so it's like do you want accuracy then use redshift do you want something that's just quick and easy to like slice and dice use Druitt and we actually ended up moving maybe a dozen of our sort of brush attacks to this prod called druid and we really had one giant file so when I spoke about we had like 50 dimensions we really just loaded a file in here like a CSV file that at 50 columns four dimensions 30 columns for metrics and let druid kind of handle all the indexing all the optimization and the UI's around that are actually pretty cool there's a company called imply that creates like a really nice tool to like slice and dice data there's gonna be called or an open or are called superset that also lets you really drag-and-drop dimensions immediately see what it would look like if you've ever used looker this is really like looker that's like without having to hit the wrong button it immediately updates like as you play around with the tool so it's incredibly powerful yeah so like following the same throw version before really what was the problem wretched real just became expensive it just didn't come like f cost-effective effective for us to use it anymore and we want to move away from sequel what did we really what channels did we run into there is we've gotten to the point here where migrating data from like one system to another system is not very straightforward I think anyone here who's worked on like regular code like what do you do oh you deploy new version of your code everything's great oh it sucks roll it back deploy another version you can't really do that with data if you have three years worth of data and you're like well let's load into another database like that's gonna cause either a lot of time or a lot of money I'll probably both so like that's a big challenge we ran into moving data from like one system to another we really ended up taking a compromised position by saying hey let's not load every dimension we ever had let just sort of zero out or null some of these dimensions just to get older data in there and have the top level metrics make sense but not necessarily have any of the low level like the ability to like slice dice the other thing is we were at this point the team's pretty small we were like two or three engineers and it's like now on top of on top of like the Java system with a path Kafka on top of like knowing the redshift world time to learn druid time to learn spark so like these are all like new technologies that our team had to learn and you you go pretty far kind of reading documentation trying things out but it's just something of concern right but over time you end up introducing more and more complexity to your stack more time knowledge standpoint but also from a maintenance standpoint and yet lastly this is something that uh I really like I found from this step to do I would say like the major lessons to me is one s3 is extremely versatile being able to like jump does just dump data onto s3 and then have whatever system you want process whether it's spark whether it's loading it from like s3 into redshift like it's just this very nice cheap data store that gives you this option ality to build whatever tool on top you want the speaker before talk about snowflake like you can load data from s3 into snowflakes so it's now you have a completely new sort of database based on that same log level data so I recommend like just using a 3-story as much data as you can it's generally much cheaper than anything else it's cheaper to even like store multiple versions of the same data in different like formats just improve improve your query time computation time and then lastly like I'm not trying to be like hyperbolic here like sequel really is wonderful it exists it's like the 60 70s and like we still use it and like you just copied a it like we'd really literally to migrate from redshift to jobs to like spark we just copy and paste of the queries change the table names and in that way it works perfectly fine like it's just incredibly powerful being able to do that so almost almost event this is like the last one and the idea here is we introduced the lambda architecture into our system if people aren't familiar with a land architecture it's this idea of collect have a real time system and a badge system that run side by side and write to the same designer datastore so the idea being you could have a real time system that's surrounding minutes behind seconds behind it doesn't do the perfect job in terms of accuracy but it gets you directionally correct data much quicker and then three hours later four hours later when the Batchelor Sun catches up you sort of overwrite that data with the more actor data that was generated by the batch system and Druid which is uh which we've been using for this supports both so drew it lets you let us load both data from these like park' like csv large batch files but it also let us read directly from Kafka so if you can see here we kept the batch system as we had from before we introduced this other class we introduced both DB which let us read data from Kafka volt would do these joins in memory it would publish them to another Kafka topic and then Druid would reach from that Kafka topic put into the same data store and then and then none of our API tools or like BI tools really had to change like which is actually pretty incredible being able to say hey suddenly we introduced technology we didn't really have to change like the interfaces at all we just have to make figuration changing through it and now we have data that's like delayed when maybe like a minute like even less in some cases so he did just being able to like do that's pretty powerful yeah so we introduced volt DB here which is this is my sort of like blurb unlike how I think about it I'm sure it's much more complicated than that but it's able to run extremely high performance loads it's meant to be like a transactional like like database so things like my sequel but it's supposed to be much more performant because you're one defining all the logic that's gonna run on ahead of time so it's compiling it and the way it stores data is gonna be optimized for the queries you're gonna be running an example like a concrete example in arc in our world is we have an event called an auction with an auction ID we have a click later on with also an event auction ID and we really just want to join them by auction ID in order to kind of determine who did this auction have a click yes or no so the way volte be worth is you would tell ahead of time fundamentally hey we're gonna be joining on auction ID when it loads the data it makes sure a chart it basically says all the auction IDs ending like or the same auction ID depending no matter what the event type will end up being on the same partition and then when you run the actual queries it does the queries add a partition level so it runs it doesn't have to shuffle data between partitions I have a shovel data between nodes it really does everything kind of in these like vertical buckets it and that's how you how we were getting like that high-speed performance there Oh once again like why do we do this we really just wanted to need maybe to bit of a strong word but we wanted faster data cuz waiting three hours just took very long time especially when we did deploys we want to see how things work or debug newer campaigns or like every time we launch the new publisher we really want to get faster information yeah and we did a little proof concept and I'm working well so we stuck with it what chance do we run into we get another technology as we go down the road of big data and a whole other thing we spent a lot of time trying to go through like versioning like upgrading versions we updated versions from Kafka like point A to point ten before we did this but then volte B he didn't have the appropriate connectors druid didn't have the appropriate connectors there are some Salty's in the like release notes which is performance is gonna suck when you upgrade and sure enough it sucked but we only found out in production because our stand environment didn't have the data volumes so let me start short like this is caused more trouble than was worth you know yeah it's separately what do we do here just 20 production systems like you do all these tests on sand and you're like dummy datasets you need the Plex production and things are very different so alters problems and lastly blamed architecture very elegant just this idea of having batch in real time and combining them it's pretty awesome so then this is the current version so I call it less is more I know I don't have a ton of time here but what really did is we replaced voltage EB because it was expensive it was yet another language we realized we didn't really need the full like sub-second min and delay with spark mini-batches we tried streaming we ran some problems and we decided that spark batching was okay at least running every 10 minutes instead of every like waiting three hours so and then everything else really stayed the same so scaling issues with volt EB we had to move back move back I'll talk a little bit about later but it's hard to them at scale what I think about paying almost always you could if you want performance you can get performance just have to pay a lot more so when you're saying oh we couldn't do this we ran to scale issues you're more and more likely not your answers like money issues you just didn't want to pay for it yes so part of his we ended up we did end up reducing time sort of like having results within a minute we suddenly had to wait like 30 minutes lastly we did spend a ton of time using all one thing dimensions we finally introduced air flow like a real scheduling system instead of rosanna we resort our own like home build system and yeah and then other thing we realized as we've gotten bigger is just all things to all people it's not a scalable strategy you give a user data and they suddenly basically say hey I deserve this data for the rest of my life which makes it very hard for you as like data engineer data engineering team to sort of rip data away from them so and that's sort of the challenge we have now we're trying to remove data we're trying to kind of actually optimize for performance and we just can't because so many people are hooked on the data running out of time this is very quick slide just to give some perspective this was five versions over I would say like five years each one roughly maybe a year and a half last said this is simplification because there's a ton of overlap between one for another even now we still have some aggregate data in redshift because it just hasn't been a priority to move it away and then three key lessons learned one the this is a squirrel eating giant nut the ideas don't bite off more than you can chew and the lesson here is if there is a ton of tech just by being at this conference well you realize that there's snowflake there's DBC from Greece conversation they're like Kafka sparks dreaming like Kafka streaming and as much as you as an engineer want to try every single one you really have to keep in mind that as your team when you ideally want to grow your team at the same time you grow the complexity of your stack like one engineer cannot manage a druid cluster a Kafka cluster like a vole cluster it just doesn't make any sense so you really have to think about the technology you use along with the resources and the people you have part of is using open-source is great using vendors is great like they offload some of it like volt TV helped us do some of the work using like snowflake they all do some of that war too but that's the price you're paying and that's something to keep in mind this I'm not gonna play these videos or I don't I guess I don't have permission anyway this was me not publicly sharing them but this was me recording the documentation swim through the dock station for Kafka and then the other one smells the documentation for druid and it's a 30-second clip each and it's basically insane like how much you couldn't figure in Kafka and you get up and running everything works great you get a high enough volume and then you realize holy crap all the configurations we did three years ago no longer make any sense and no one has the knowledge to do it and we've run into this again and again and again we're just doing the DevOps and like the configuration for like big data system is just external extraordinarily difficult and part of it is because someone on the dev ops team may not have familiarity with druid someone on the data team may not have from the area to it and it's this sort of like no-man's land of being like how do we optimize and how do we tweaked and tuned our systems and then lastly yeah I mention I hinted at it earlier but unless you're the size of Facebook or Google at like a real huge company like in our case we really weren't limited by technology we were really limited by our like money like we just can't afford to do all the systems we want and that's why we have to come up with these like I would say like simplifications and I suspect the companies that are like similar-sized like your challenge is gonna be technology it's gonna be like making cost-efficient and that's it so thank you the design choice in choosing spark as your batch processing execution engine as opposed to something like redshift spectrum have you looked into that and so yeah no it's a great question well the simple answer is when we move to spark wretches spectrum didn't exist so like that's very simple I in turn like at some level we've committed to spark already I still think it was the right choice because our data science team is also using spark right now so sort of nice that there's this shared like experience and sharing knowledge of the spark system that both teams could benefit from it does also 40 couples you from Amazon somewhat like we do use Athena for example in some cases like for some of the queries where we want to create our lodge directly without really investing in the true data pipeline we will we will like after we upload the data to s3 from using say core we have another job in airflow that will index it and load it to Athena and then when we you and then for Athena we really just have like a few simple jobs if you simply like queries to go off of that instead of having to go through like a whole separate spark system so I think it's really interesting your use case and we like awesome we haven't explored too much because we were already in sparkling [Music]

Info

Channel: Data Council

Views: 2,929

Rating: 4.9111109 out of 5

Keywords: digital advertising, ad server, ad tech, data pipeline, building a data pipeline, Kafka, Redshift, Secor, Spark, Spark Streaming, VoltDB, Druid

Id: Y7VNk73qGRU

Channel Id: undefined

Length: 34min 51sec (2091 seconds)

Published: Sun Dec 30 2018