ETL Is Dead, Long Live Streams: real-time streams w/ Apache Kafka

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone I'm super excited to speak at the keynote session into Khan I keep telling well you know - Khan is one of my favorite conferences to be at I feels pretty much at home here you know this talk is about a big change that I experienced companies going through in my day to day job this changes about how companies manage their data the resulting ETL technology that they use and how the rise of real-time data and stream processing is driving that change in this talk I'm going to you know go through how things work today what are the drawbacks now what is this new shiny future for ETL look like and then what role does Kafka play in this picture so before I start can I get a quick show of hands of people who've either heard of or have used Kafka in any formal manner okay about 90% I'm super excited all right so let's get started you know data and data systems I've observed have really changed in the past decade if you think about how things worked about roughly a decade ago later really resided in two popular locations the operational databases and the warehouse most of your reporting ran on the warehouse about once a day sometimes several times a day and so Digga don't really need to move between these two locations for any faster than several times a day this in turn influenced the architecture of the two stack or the technology to move data between places called ETL and also the process of integrating data between sources and destinations which Godley came to be known as data integration but now several recent data trends they're driving a really dramatic change in the ETL architecture first off the single server databases are rapidly being replaced or augmented by a whole bunch of distributed data platforms these can operate a company skills maybe it's no sequel systems like Cassandra or MongoDB or elastic or no sequel you know other systems SAS applications and so on so your ETL tools you know have to handle more than just databases in the warehouse there are many more types of data sources that companies are interested in collecting beyond the transactional data sources these include logs sensors metrics so now your ETL tools need to you know go beyond handling just the relational data model and be able to support a pretty large variety of data models and last but not least you know the stream data and the rise of stream data it is becoming increasingly ubiquitous there is a need for processing data quickly as it arrives instead of in batches now these trends are playing out but the technology hadn't caught up so if you're wondering what data pipelines and companies actually look like you know this is what I've seen in practice there are applications that talk to each other using some kind of enterprise messaging cues there are custom ETL scripts that are written to move data between sources and destinations you know this ad hoc manner of connecting sources and destinations in a one-off fashion as they arrive is pre chaotic it is unmanageable this lossy in this talk I want to explore how transitioning to streams cleans up this mess and it works towards a much more you know scalable manageable ETL architecture this is the idea of having a streaming platform that serves as your you know central nervous system it allows applications to talk to each other it allows you to stream change logs from databases and make it available to other systems and more importantly or interestingly it allows you know stream processing applications to thrive that can transform this data in a much more incremental fashion now before I take a look at the solution when we take a quick look at a short history of how these tools evolved data integration is surfaced in the 1990s when retail organizations wanted to analyze bio trends the way they did that is by extracting data from operational databases transforming that data into a single global schema that matches the warehouse and then loading the data in the warehouse by the early 2000s a lot of companies and industries followed this trend and a new you know class of technology emerged to essentially take this data movement between operational databases and warehouse called extract transform and land load which is ETL in a simplest form just means copying data between different locations now ETL tools have been around for a while but the data coverage in warehouses is still pretty low you know what I've seen is out of thousands of operational databases maybe a couple hundreds are really available in the warehouse and you might think why well it's because you know ETL tools the traditional ones they have drawbacks that show up when you try to use them in practice right here are a couple of those drawbacks one is the need for a global schema no data modeling is a really hard problem in its own right but modeling you know one global schema for a very large domain is even harder and this is really something that limits the plausible scope of either warehouses or ETL tools the second is that you know of ETL the tearily stands for data cleansing which is you know transforming data into its cleanest form and defining what it means and that process is error-prone it is manual and as you can imagine from the previous two drawbacks the operational cost of ETL is really high now it is slow it is time and resource intensive most importantly you know ETL tools were built for a niche problem but narrowly focused on connecting databases and the warehouse and do that in a batch fashion now because ETL was not only focused on that problem a new class of technology marched to connect applications in real and that came to be known as enterprise application integration you know it is just a bunch of tools that were built to facilitate exchange of business transactions messages between applications unsurprisingly they used Enterprise Service bus is underneath the covers or enterprise messaging queues underneath the covers and the problem was that these MQ worked for small scale data they were just not designed to handle the scale of data that is required for modern data sets like logs and sensors and so we can't really use this for any kind of company-wide large-scale real-time data integration though some is that you know the class of technology available for solving this whole data integration problem whether it's ETL or eai to really outdated nome on one hand you have enterprise application integration it is real-time but not scalable on the other hand you have ETL which is scalable but batch and so organizations you know face a tough choice when you adopt one of these you get either real-time or you get scale but batch but you have to pick one now these trains are pretty big you know they demand not a small change small changes to these tools but they actually require a complete revamp this complete revamp is required to create a technology that is actually suitable to a modern streaming world where real time and scale they're not the exception they're pretty much the world and so as a result of this you know modern streaming world actually has a new set of requirements for data integration first is you have to have the ability to process not just high volume but high diversity data it needs to be real-time from the ground up you know and you might think what does that entail one is we need to have the technology to support that but the other ways which is equally important is that we need to make up pretty much fundamental transition to what I'm going to call a vent centric thinking so try to explain that a little bit more with the help of this example let's assume this magical streaming platform can support high volume high diversity data as well as do that in real time now this is an example of a retail web app it logs product page view events we want to analyze that in Hadoop so on one hand we stream those events into a streaming platform and on the other side Hadoop subscribes to the streaming platform and loads that data back works but at this point it may not be clear what problem the streaming platform is solving so if you go one step further you know you realize it over time you've added more ways of producing these product events you launch a mobile app you create external API when that happens you will notice in this picture the Hadoop side of the load doesn't need to change at all this decoupling that is introduced by the pub sub model in the streaming platform it allows things that produce data and think that consume data to evolve on their own because they don't have to know about each other going one step further as you add more downstream applications that need to consume the same products events but process them differently you will notice that as you add those you don't have to add point-to-point connections with everything that produces the same data you merely subscribe to the central streaming platform so the end result is that you know what I've observed is events and click sinking when applied at a company wide scale is actually the core reason for you know ending up with a much more cleaner streaming platform and a much more cleaner ETL architecture the quarter is you know the third equipment is of forward compatibility so this event centric thinking actually has another advantage it allows you to create a forward compatible data architecture what I mean by that is the ability to add more applications that might need to process data differently if you observe in the previous example right every new application was added to solve a business need work that was felt at that time there was no way to predict that ahead of time so when we added a retail web app we didn't know we might create a mobile web app or mobile app in the future when we had just heard OOP we didn't know that we might have to add for other ways of processing that same data differently and this is really important you know your ETL architecture it needs to allow new data sources and new data systems to emerge over time to use the same data differently so to explore this you know requirement or just need a little bit more to enable forward compatibility that is multiple destinations for the same data it has a really important implication on what key stands in it here which is traditionally for you know data cleansing but we need to move from data cleansing to data transformations let me explain that with an example let's say this is the same example we have the logs we need to now move them to the warehouse traditionally you might extract that as unstructured text your tea actually stands for data cleansing which is really defining what this product view means all the different fields and then your load involves loading it into a specific system which is the data warehouse and liquor you might need to you know drop some personally identifiable information fields from this data or to really make it usable for the users so that custom transformation runs on your warehouse now what happens if you add another destination for this data let's say Cassandra you will notice that you repeat the business logic for extracting that data you also repeat the transformation which is actually cleansing that data and turning it into a real product view then you load it into Cassandra and then you run the same transformation but this is free wasteful you know in order to allow more destinations if we repeat the business logic of cleansing their data it is not only inefficient but it can also be lost if one of these scripts fails what if instead we make clean data available up front which is we extract and we load just clean product events into the streaming platform the transform runs on the framing platform to create another stream which drops the pif fields and now you have two choices for loading the data in either the warehouse or percent or anything that might arrive in the future now the second implication on this tee is you might wonder well if clean data is available up front then the tene ETL completely go away not really it now turns and stands for actual data transformations to make data ready for destination system so for example if you were to run a query like find me the top and popularly viewed products in a certain price segment we now have to enrich this data that previously only had you know all the products events but filtered on PII seals when it is done on top of this streaming platform which has the clean part of events upgrading your transform you know becomes simpler you upgrade it to John chose drop events but now enrich it with product metadata which might itself be a change log stream from another source database that is available in the streaming platform and this is great because your your extraction and transformation is done once but now your load can be done several times in different systems they don't have to all repeat that transformation so to summarize you know this point is important forward compatibility it actually stands for let extract clean data once let's then make it available to be transformed in several different ways to load it into the respective destinations but then do that as and when required so to summarize you know what are the needs of a modern streaming data integration solution we need scales we need diversity latency and more importantly forward compatibility I summarized these needs because you know they drive the requirements for this solution right we need fault tolerance and parallelism so we can deploy lots and lots of these ETL processes to handle large data sources we need it to support low latency delivery semantics what about ordering that is important operations and monitoring to be able to you know view and monitor all your ETL copy of processes centrally and then schema management on how schemas can evolve as you copy data but these are all hard problems in their own right instead of solving them in a one-off way in custom ETL tools that are meant for specific systems I am advocating for an approach that is you know much more practical and efficient which is let's solve all these problems in a common platform which is reusable for many different use cases so now I want to pretend for a new and shiny future for ETL looks like this is in this future all your data is represented as streams the central streaming platform it serves as a storage layer for your stream data extract and load involves moving streams between external systems and the central streaming platform and transformations actually takes the shape and form of stream processing now the streaming platform it serves as you know almost like a central nervous system for your company's data it serves as the real-time messaging bus for your applications can exchange messages it serves as the source of truth pipeline for feeding any and all data processing systems B to do poor warehouse or no sequel systems or or several more and it actually serves as the you know building block for stateful stream processing microservices or applications which all represent your company's business logic as stream processing in this future you know companies still have the data integration problem the solution just looks very different in a streaming first world we still do ETL but in a streaming fashion on top of a central platform redefining what T stands for which is essentially stream processing so let me summarize you know what we've gone through so far before diving into the streaming platform we went through a short history of what data education is what are the drawbacks of the ETL tools what are the needs and requirements for a streaming platform what is this new and tiny future for ETL look like and in the latter part of this talk what I want to go through is no waters a streaming platform look like how does it enable this streaming ETL and that journey starts with Apache Kafka you know it is an open source distributed streaming platform we created Kafka to essentially make this event centric thinking available at a company wide scale we had a very particular vision for what a company should look like if it had reimagined is use of data around streams of events we started Kafka roughly six years ago at LinkedIn today it is deployed at a pretty large scale at LinkedIn it serves more than one point four trillion messages per day across several data centers and taka is now adopted across thousands of companies worldwide from you know retail to web tech and FinTech and so on roughly about thirty five percent of Fortune 500 use car core today so what I want to look into is you know what role does Kafka play in this new future for data integration well first off Kaka is the de-facto storage of choice for stream data so most of you are familiar with Kafka some of you might also be familiar with this log abstraction this is what the storage back-end of Kafka is based on which is this concept of a persistent replicated right ahead and append only log where every record is identified using a unique index called an offset the rights are only in the form of a pins readers can use that offset and index into the log and then read messages in order now the key inside of the heart of Kafka is that this abstraction is a great primitives for building scalable pub/sub messaging so you can imagine that your publisher is the one that append data to the end of the log and your subscribers in Kafka line they maintain their of they can index into this log and then scan sequentially from there onwards the key point is that the sequential nature of reads and writes it allows this abstraction to support pretty impressive through ports so Costco supports about you know hundreds of thousands of messages for second force over the post players you know Kafka is a storage back-end but Kafka offers a scalable messaging backbone for application integration so this is what we all know coughs call for one of the core API is of Kafka is the messaging API which allows you to produce and consume messages so applications embed these libraries and talk to each other using Cockrum but the sector you know the third pane is like copper actually enables building streaming data pipelines in the onine release of apache casco the community added or connect api the core focus for connect api is to make building these streaming data pipelines from external systems into kafka really easy in an off-the-shelf manner and last but not least the way copper completes this picture for data integration is that now it's the basis for stream processing and transformations in fact the oat engine ease of Apache Kafka added the streams API which allows you to essentially write stream processing operators or tree processing programs very easily on top of Kafka you can embed the streams API and write stream processing in the application so in the remainder of this talk I want to dive a little bit deeper into both the connect and streams API because we really complete the vision for streaming ETL let's start with copper connect API which is really the e and the L in streaming media you know building streaming data pipelines using Kafka is all about these connectors in this picture it seems simple to just draw the line and have data flow between any external system and Kafka but there's a lot that goes underneath the covers to do this correctly and let's take a step back and look at the system level view of things that was a logical picture but cleaning ETL today it might involve moving data between data centers companies have multiple data centers for several reasons so you might be migrating from an on-prem to a public cloud for disaster recovery mergers and acquisitions so any problem that involves moving data between sources and destinations today it involves moving data also between data centers in a straining fashion Kafka's Connect API there are two abstractions pretty easy to understand the resources that pull data from external systems into Kafka and then there are things that push data from calf's car into external sinks and some of these sources and things can be written in both streaming and batch fashion if needed this is the most interesting ETL problem as we talked about how do we make data from databases available to not just the warehouse but any other application that needs to use it now if you think about it for a moment in order to do this in a streaming fashion one way is you can set up triggers and you can scan the database and that works but that is inefficient another way is to stream the changelog of a database now some of you might be familiar with this you know databases are designed like this underneath the covers where they rely on a commit log as a source of truth and tables are merely views of that commit log and the way database replication works for the most part is by shipping these change logs around the change log is you know it's an abstraction where every message is essentially an update or a change or a mutation to the database so if you were to scan this change locks on the very beginning and apply to an empty database you can essentially recreate the database you will notice that this abstraction looks very similar to the log abstraction in Kafka in fact Kafka has special support for supporting change logs and these database connectors are in fact the most popular ones written on top of Kafka's connect api it has actually another cool which is by making these change was available in Cos Cob now transformations become much easier and they are much more scalable so instead of applying transformations on either the source of the destination databases it can be made available on this you know replicated log which is a lot faster so instead of moving data just blindly between the source and the destination it moves through Kafka you can transform it and then move it into destination applications so the core focus of Kafka's connect api is to really make writing these connectors super simple you know to make it available you know really off-the-shelf manner and do all the hard work underneath the covers so Catholic Connect API is it leverages and it builds on top of Kafka's scalability and fault-tolerance model it allows you one way of monitoring all your connectors and most importantly it offers the option of carrying the source schema into all the destination systems so what you can do is if you add a column in your source database what would have previously broken data pipelines now would carry that extra column seeming seamlessly throughout the data pipeline and you know apply it to either the elastic index or a hive table and do that you know without users knowing about it in a pretty you know transparent manner today you know the lots and lots of these sources and syncs all open source available for use so you can connect our pretty large set of sources and syncs in a off-the-shelf manner using Kafka's Connect API so then it connected e and L let's now look at you know the streams API which is really sense for the T in streaming ETL stream processing is all about applying transformations on stream data transformations can take many forms you know filters Maps windows joints and aggregations and so on before I take a look at strings API let me take a step back and talk about these broad two visions for stream processing I got a chance to experience both the first vision is let's make MapReduce go faster so let's build a real-time MapReduce layer that used to work really well so let's make it go faster the other vision is very different it says you know let's look at all your business logic all your applications and think of those applications as event-driven you know stream processors I mentioned these two visions because they really influenced what the solution looks like so if you think about building stream processing as a real-time MapReduce layer now you have a central cluster that runs a whole bunch of processes you have to express your stream processing code as a job that is packaged in a custom manner just like Hadoop is it is deployed and monitored in a custom manner you probably have yarn or methyls for fault tolerance and so on this models probably works well for you know long-running analytical type of queries where you want to run a large multi tenant cluster now event-driven microservices you know the focus is very different it's saying that let's think of applications as things that take input streams business logic that is really stream processing and and then it produces output streams if you think of that and making that easier then all your stream processing engine really has to be a library that application developers can just embed and start using so in this world you have a copper cluster and then you have your app and then you don't deploy anything else and the main focus for this vision is really to make stream processing available as a general-purpose programming paradigm it is not a niche thing but it is available across the company here are some examples of systems that subscribe to the real-time MapReduce vision in fact I'm I had a chance to work on Apache Sansa which is also you know similar to some of these systems and while putting it into practice at LinkedIn we learned that you know what developers wanted was they wanted to continue developing their Java apps what we asked them to do is to take some part of the Java app and express it as a job and then talk to the stream processing people to get it deployed on their cluster so that created sufficient you know friction in order to adopt stream processing at our LinkedIn wide-scale so then we introduced you know we looked at this problem a different way we we came at it from the vendor in micro-services vision and created streams so streams is you know the streams is just a library it is an API that you embed in your application and then you can do stream processing so really you know the core focus that we had when we created streams with them you know let's make it the easiest way to do stream processing on top of Kafka you know people love the producers and consumers libraries let's give them a stream processor library so as a result of that you know this is what it looks like it is a powerful but lightweight Java library so all you need is if you have a copper cluster then you have a streams API that can be used to transform Kafka topics it is a it has a convenient DSL with all sorts of operators still evolving but joins map filter windowed a cricket's and so on so this is a you know code you might write just as an example the cliched you know word count example you create a builder you write your code to count the words and then in this example the output is just another Kafka topic but you could send it to any other external system and then you say start the cool thing is you know you might realize that you can take this code and run it on one instance and it will run just fine you can package it in a docker container and deploy it on measures and your code doesn't have to change essentially all the hard work which is how do you load balance and assign partitions to all your different application instances is handled transparently by the streams library because it builds on top of Kafka primitives streams is a event at a time stream processing engine so it doesn't do micro batching it can process event as they arrived we've been pretty inspired by all the insights that Google's dataflow team has shared which all revolve around you know how do we handle late arriving data without going into too many details the core insight is that we need to accept that there will be later driving data in when you're processing data in the streaming fashion we just need to be able to handle it correctly so the insight is that let's differentiate between event time which is when an event takes place in the real world from processing time which is when it actually gets processed and if you handle these two things differently then we might get correct results even as you get later driving data this is a pretty deep topic in an office on site I know Tyler and Francis are giving a talk on this I would highly recommend going and you know learning from those two great engineers the fifth point is caca streams API it does a school thing it has out-of-the-box support for local states so this is one of the things we learn from Santa that worked well and we adopted it in Coffs cross streams API and it is essential because it really allows you to build these high throughput fast stream processor apps so to get into a little more detail you know if you think about state and how do you manage state and apps the traditional way is you know let's let's stick that state in some kind of external database and get done with it now this works right you can trust this external database so there are some inefficiencies at the same time one inefficiency is that there's less isolation so any app could overwhelm the database you have less choices for how to pick different databases depending on what your app is doing and more importantly you know in order to write a stateful stream processor app you also have to manage this external database what strings does is it it pushes this external database and divide it up into shards and makes it available as local embedded state now this local state could be a rock CD engine it could be an in-memory hash map and there will be more that come you know as we you make more profits on streams but the concept is that this is highly efficient because your data is charted the same way as your input streams so all your processing can happen locally with the data available you don't have to make external RPC calls and hence that is super efficient efficient more than diet I think this local state is also fault tolerant because it's a local embedded engine what if your app dies does your state go away not really the cop car you know Kafka and Kafka streams API it knows how to do load balancing just how it load balances your partitions so if an app instance dies it automatically load balances it moves that local state embedded database into the remaining instances that are still alive and all of that happens without users finding out so it gives self heal automatically another important point about cockroach's API is that allows reprocessing so you think about you know what we do with Apps we upgrade apps we do a/b testing and when that happens we need to reprocess you know go back and reprocess either a small window of data or perhaps sometimes the entire history so let's take an example right you deploy an app and a day later you find a bug so all the results that were processed using the last 24 hour window perhaps are incorrect so as you deploy the new version of your app you actually want to go back and reprocess the 24 hours in order to reflect correct results now this again is one of the you know deeper topics to get stream processing right but it is one that is really important now let's let me conclude with an example or to show how these two visions can really influence how your solution looks like in this example you're building a real-time dashboard for security monitoring you're monitoring user activity across the globe you're aggregating by geo region and then in this particular app you're highlighting regions that have irregular user activity now here's your solution right if you were using vision point you have perhaps kafka clusters that are hosting all these user activity streams you have to deploy this stream processing cluster you write your code that is the aggregation geo aggregation code you deploy it as a job on this cluster you might have counts and you want to serve it through the dashboard app so now you use your external database and then you finally had your dashboard app that reads the counts and then highlights it on that UI now contrast this with vision two you have your copper cluster that hosts all those activity events you just have your dashboard app what it does is it embeds the streams library it process it is counts those and it stores the counts in the local embedded engine the local state engines and Casca streams API allows the local engines to be query queryable so in your dashboard app not only are you query or not only are you storing all those state or the aggregated numbers you're also able to query it and display it on your dashboard so in this picture I think it probably summarizes all forecast curves streams API is about is essentially simplicity now what I'd like to you know conclude with is a lot of these ideas this is observation about what batch processing really is it is it is a style of processing where you can take a window of data and you process it and after you're done with that window you shut down then you wake up at a particular future point in time you process the next window and you keep doing that so let us film that up you know streams apps can do that you know it's processes window and then it shuts down now contrast that with you know traditional streams app which is that when it finishes processing that windows it doesn't shut down it keeps going on and it keeps processing the next window as it arrives so you'll realize that you know this is nothing but a different style of processing on the same abstraction essentially logs actually help unify batch and stream processing on a single usable layer so to conclude you know this is the streams the connectivity along with Kafka it really encapsulate everything about what streaming ETL means and looks like and remember this messy picture it helps you actually clean that up and end up with a much more you know scalable and manageable ETL architecture this is a vision that we started talking with to the vision that some of us have at confluent is really to make all your data available everywhere and and immediately if you are intrigued about some of these ideas if it was you know too much or too much too soon then we have a tutorial on friday if you were to dive deeper into connecting streams one of the council members is going to walk you through that we blog a lot about these ideas so if you hit confluence dot IO / blog you will be able to catch up to the latest of what we are all up to and with that I'd like to conclude this talk thank you very much for listening [Applause]
Info
Channel: InfoQ
Views: 243,111
Rating: 4.7455783 out of 5
Keywords: Kafka, ETL, Software Development, real-time, QCon, InfoQ, Apache Kafka, Data, Big Data, Data Analytics, Data Streaming
Id: I32hmY4diFY
Channel Id: undefined
Length: 39min 0sec (2340 seconds)
Published: Thu Feb 23 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.