Building data-driven applications and ML pipelines with Golang | Felix Raab

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] so hi welcome to my talk building data-driven applications with go my name is Felix I'm running the engineering teams at a company called ki labs we mostly work on software and data engineering projects and golang is part of our technical stacks and today i'm going to talk about how you can use golang to build data intensive applications so why data intensive I didn't want to write down the overused data is a new oil claim however it is fact that we produce more and more data you can check out the visualization called data Never Sleeps in its current version 7 it contains some impressive statements such as by 2020 there will be 40 times more bytes of data then there are stars in the observable universe and the picture shows other interesting examples on how much data we generate every minute so consequently we need ways in our applications to deal with larger amounts of data now a lot of big tech companies have released their internal tooling to deal with those massive data sets as open source and this has led to a vast big data ecosystem that is getting more and more complex and also confusing but chances are you're not processing in your applications data at that scale so in this talk I'd like to show you how to solve let's say 80% of your data problems with simpler tooling so we're not going to talk about Hadoop or running go on kubernetes Sogo itself has a couple of built-in language features that make it very suitable to efficiently process data and when we talk about processing data we often talk about data pipelines so how could you define a data pipeline we could define it as the process that takes input data through a series of transformation stages producing data as output and then in the entire data world we have a special type of data pipelines it's called machine learning pipelines and this we could define as the process that takes data and code as input and produces a trained ml model as the output let's first have a look at what the typical main building blocks of a data architecture are so first we need to somehow ingest data from our sources into the system then we often store the raw data then we process it then we might store it again maybe this time in some clean or optimized format and finally serve it so if it typically through an API endpoint or pops up this basically picture more or less also maps to the well-known concept of ETL extract transform load and in this talk we mostly focus upon the ingest and and process parts so if you look at the data flow in a typical pipeline setup it roughly looks like this so raw data in different formats is interested often through a queueing system then it ends up in a stream processor or it ends up on typically HDFS which is the Hadoop file system or maybe an object storage such as s3 then depending on whether you are dealing more with streaming data or batch data after processing it it goes back maybe to the queue again or is process then picked up by some batch jobs a final destination is then often target storage persistent storage optimized for certain data types this could be relational data times your real estate document key-value stores and so on and from there then we access it and serve it to different clients so the example I explained before would already involve a lot of expensive data tooling but what is really the the simplest thing you could do to set up a data pipeline that can still process a reasonable amount of data here we can see an example from marketing Clements book designing data intensive applications which is by the way a great book if you want to know more about the topic you should read it so he gave this example and there we connect simple unix tools through pipes so as you can see we have oak that reads an access log file then we extract some part of that log file we sort it we determine the the counts that for the unique values then we numerically reverse sorted again and use five five items and this looks very simple but it's actually more than a poor man's solution so first of it it uses this UNIX philosophy so we compose small programs that do one thing well which gives you a lot of flexibility and also it actually is able to handle large data sets because if the data wouldn't fit into RAM it would automatically spill to disk and also run in parallel across multiple cores so it's actually quite quite powerful on the other hand tools like awk for instance can get really also awkward in terms of their syntax I don't know if you're aware that awk is actually complete Turing complete language and I'd argue that anything more complex than the simple example I showed before would involve for most of us a lot of googling trial and error and so on so to actually achieve what you're after and what I what I showed earlier typing these UNIX tools together has often been loosely referred to as Taco Bell programming which then finally brings me to go and it's standard library and built-in language concepts such as some channels and go routines to realize basically what we would call advanced a COBOL programming so we can use go routines and offload work to multiple CPU cores and benefit from parallel processing and we can use channels to communicate between those processing stages and construct dataflow pipelines and then in the end you can all easily deploy it in a single binary without any data infrastructure nightmares however even with a nice language and runtime things are unfortunately not so straightforward when dealing with large datasets so one of the main challenges is memory meaning whenever you're working set which is the data you're currently processing in memory is bigger than the available memory on the machine you obviously run into out of memory situations and this is bad because then you need to use this Keio which is an order of magnitude slower and you could also run of course out of disk space and so on but there are a couple of techniques how you can address this challenge so first instead of adding RAM or spinning up an expensive big data cluster the first technique is compressing your data and you we're not talking about zipping it and unzipping it but actually compression in terms of how data is represented in memory a very simple example could be to store some values not as strings but as different data types such as boolean or bits such as 0 and once the second technique is that you can junk data which is you cut the data in logical pieces and process them one by one in parallel so you don't load the entire data set into memory you might know this principle as MapReduce which is basically doing this in a distributed fashion so how does MapReduce work conceptually using the example of counting books in a in a bookshelf it works like you count up the Shelf 1 I count up shelf 2 and the more people we get the faster this process would go and then we reduce which is we all get together and add up our individual counts the third technique I'd like to mention here is indexing this is common practice for instance when using log files and lock rotation you usually put all data for a certain time period into a logical file and then you maintain an index which is sort of a summary to point to the individual files and then you just load them and actually need it so now we really move to some go code and examples that show you how to build a simple data flow systems and the first concept here I'd like to mention is generators so we could define a generator as a function that returns a sequence of values through a channel it could also refer to this as a producer only module and use cases for a generator could be streaming numbers loading loading a file reading from a database scraping the web and so on and for the purpose of this examples we just type this channel 2 to an integer but you could of course use any custom data structures as well the second abstraction is a processor which we define as a function that takes a channel and returns a sequence of values through a channel use cases here are any sort of number crunching data aggregation deduplication filtering validation and so on so everything what you basically do in terms of transformation and processing and finally the third building block of our data flow system is a consumer which is just a function here that takes a general and in the consumer we can consume the channel which is we could print the sequence save the data somehow etc now when setting up these dataflow functions what you can do is creating functions that return the type definitions which are themselves defined as functions and you may know this pattern from setting up on HTTP handlers and middleware for server applications what is it useful for so for instance you could further customize these inner functions while still returning the expected type which is generated in this case and here in this example we just have have parameters that we pass through to this outer function that we could then use in in the inner function one example for our first pipeline is a simple number generator so here in the inner function as you can see we create an output channel and then in the go routine we write numbers that were passed in the outer function to two number generator and we write amazingly to this output channel and when the loop is done we properly close the channel and the inner function returns this output channel to stick to our basically type definition next is the processor here we do something similar except the inner function ranges over the past channel then we do a simple calculation squaring the numbers in that case right again to the output channel and also again close it when it's done and finally return this channel the consumer then just takes the channel and here it has a forever loop basically it reads the it receives the value from the channel checks if the channel is still open it can actually be received and if yes we print it if not we we exit next we can construct this pipeline and actually run it so what we do here is we call our Creator functions nest them so that the channels can get passed to these functions and then we rerun it so we have a print consumer that accepts the channel from square processor which again accepts the channel from number generator and in this case a case since we just square the numbers we would just see 2 4 and and the other numbers that we pass to to number generator we can get a bit more advanced by implementing some of the memory techniques I showed earlier for instance chunking and indexing so we construct a pipeline that uses fan in fan-out patterns to D basically divide up the processing work and then later merge these channels back before the consumer iterates over it and this is what it could look like in code using basically the example of to lock files passed to one processor each and then we have a new function log aggregator that takes a variable number of channels and merges them into one channel so there are many options basically how you can set up these generator and processor functions to compose the data flow depending on your needs what I just explained is also often called multiplexing or merging so the aggregator as I say takes a variable number of channels and returns and may be buffered output channel and multiplexing we could then say is ranging over all the input channels and internally using wait groups to close the output channel when all inner go routines writing to this output channel are done and the code for this or part of the code could look like this we use the wait group from the sync package and then in the outer for loop we arranged over the channels we add to the wait group start the go routine the go routine gets this channel past and then internally ranges over it and writes all values into a single output channel one important topic when linking together pipeline stages is making sure not to block channels or run into deadlocks and there are a couple of strategies that help you to address that one is to make sure that each basically stage that owns the channel is also responsible for closing that channel so imagine a situation where a downstream processor would run into an error and could not receive the value from the upstream channel this will basically block the sender and what you can do here is you could use buffers to to avoid that but it's a bit tricky to get them right in terms of what what does your buffer size need to be and so on in any case what you should do is have a way of shutting down and stopping the entire pipeline and one way to achieve that is by creating a dump channel and passing that dump channel to all functions and then you can use a select statement which ensures basically that is ranging over the channel continues either when the send operation right with your processed value succeeds or when value on the done channel is received in which case the loop will exit and the channel is finally closed so you might also actually not want to explicitly close the channels at some point but rather continuously consume a stream of values and process them and when we talk about real-time streaming what we usually actually mean is processing at very frequent update intervals and one very simple way to achieve this through a generator is introducing an artificial delay in this example here just through as beep call so you can see we create the output channel a number random number generator and then have a loop and this loop contains basically a sleep call and what we don't have is the closed statement for the channel so it would continuously sort of stream the data into this channel if you then have streams with high update frequencies you might wish to rate limit them for instance to avoid quota exceeding or resource exhaustion and what you can do then is you can add another processor that limits this processing to a defined rate for instance n operations per second so in your processor here what we do is we create another channel called throttle and then we use a ticker to write values into the throttle channel on each tick and this tick is basically defined through the rate that you set then later in in the bottom part here when ranging over the input channel you first receive the value from the throttle channel before writing to the output and this receiving operation here at throttle would have caused block according to two to the tikka rate and this is how you can throttle the entire thing so what I've presented here so far may be sufficient for simple pipelines and processing needs but your use cases might be more complex so maybe you want to consume data from your existing source systems or queues or distribute the computation across multiple machines and for anything beyond simple pipelines there are a couple of open source projects that I wanted to point out one is called benefice it's a stream processor it implements a concept of sources and syncs and supports actions transformations filters and so on we will have a look at it later and then another one is called Oh Tommy it's a stream processing API it implements similar concepts to what I just presented they call it emitter operation and collector it's basically an abstraction on top of channels and but it's it's very early stages and then we have big slice which is an talked cluster processing framework not strictly streaming but you can distribute computation across multiple node similar to if you're familiar with with SPARC what that does but it's basically server less and it's much easier to to deploy and use so one of the concepts that benthos implements is that of sources and sinks and other terminology in users producers consumers or inputs outputs but it's it's all basically the same thing now depending on your existing infrastructure you may have heterogeneous data sources and target system so it'd be ideal if you could somehow connect them in a flexible way and just customize the processing logic and while you could achieve that with the techniques that I showed you may wish to have a more robust solution that you can configure and that also comes already with connectors for popular data components so as mentioned benthos is one such solution it has a number of benefits so it directly links inputs to outputs and also supports the concept of acknowledgments it also has optional buffering features then it supports resilient and customizable processing logic you also have some convenience features too that you can add to your processes for instance some some JSON queries and it's in general well-documented you can configure it and it's also used in some production systems so how does that work under the hood there is a very interesting presentation from the main developer on benthos and he presents the inner workings through a simplified code snippets here you can see for instance that reading is not directly done from the input channel but from from a queue abstraction and writes into the channel are done using another abstraction called your transaction the basically enables communication back from the syncs to the source systems by a response channel and this is how they implement acknowledgments which means making sure that the data was really committed to the target systems customizing the processing logic is also quite straightforward as you can see here so after reading the configuration here a processor is added and the processor is just a function implicitly conforming to their processing interface and speaking of configuration you can describe that in a jumble or JSON file and set up your inputs and outputs here for instance the Kafka broker is configured meaning that you want to consume data from a certain Kafka topic and then processes in bansal's are handled in a separate layer and you can configure how many of those processes you want and what they should do and then you configure your targets which could be files a database and so on in this case it's just another Kafka topic where the data should end up after processing so some final data recommendations so use go obviously and also maybe try to play around a bit with with some of the dataflow programming techniques that I showed you should also stay lean and check if you really need these big data toolings since you should be aware that there are both infrastructure complexities and runtime overhead involved so for instance splitting datasets serializing them sending them over the network to do this this distributed computation is quite expensive process and so in generally just critically evaluate and and maybe run some some tests and spikes before there are actually many stories where people wanted to process a decent amount of data and then early used spark which somehow it took 15 minutes to process the data set plus the ceremony of launching the cluster and tearing it down again while directly querying the relational database would have maybe taken only a fraction of of a second so to recap in this talk we looked at basic building blocks of a data pipeline in some of the challenges when when processing large data sets then we covered some general techniques to work with data in in memory we looked at how to construct simple data flow systems using go routines and channels and finally we looked at more advanced open source tooling to set up more flexible and scalable pipelines so my last slide I have also a couple of links that you can check out later to learn more about what I've presented and I think that's all I have thank you and in some questions [Applause] plenty of time for Q&A so if you've got questions let's raise your hand I'll get you a mic okay I will you see one over there hi Maya thank you for the talk the question is for example if I M have several pipelines and by some reason I want to control the throat of each pipeline how can I do that on this low level using go and goroutines channels and so on so if I want let's say I have pipeline that working with database and for some reason I want to spawn more workers or less or something like that how could be easier to configure it on fly it in the right run time so so one thing you can do is what I showed is that if you use these processing functions for example they use internally go routines and you could of course beam up more of those something what I did manually in this example when then before I merge them back so this sort of MapReduce principle then you can have other if you want to control I understood throughput and so on more you could also have custom processes that sort of limit the flow of data and if you use something like benthos for example there you can exactly configure how how many processes you want to have and so on both on the consumer side and also when you basically extract it from from the source systems but if I want to change this number on flight how to better do that kind of HTTP handler with a callback or maybe maybe I don't know dynamically rich and updated configuration file well I mean so basically I guess you in production you want to do this based on some metrics that you collect and that you monitor benthos for example as far as I know can be updated live so whenever they have an management API there where you can change this configuration and it would update in real-time and then maybe you could connect it to some sort of metrics that you collect Thanks thank you for the talk what are one of the problems a kind of a counter with using like the simple stuff it's like avoiding bottlenecks like waiting on channels that like the consumer is not consuming enough the producers are like what are the patterns that you recommend like solving this kind of problems like one case is like rate limiting and what what are other kinds of patterns would you recommend for that so you mean in order to control the rate of data the rate of data so you don't have the whole necks like well so what is typically done is to introduce buffers then into this even though it's interesting the the main developer of Pentos says don't use buffers at all because most of the time you don't need them that there but there might be cases where you need them and then of course if you have producers that are faster than your consumers you can fill a buffer and consume from there and then it also depends on how how many CPU cores what what machines do you have available and this is how you then configure your processing logic I think there is no no single answer to that question it's it's a very complex complex topic Thank You doc small question what do you think would it be a more scalable approach if you just use normal queue for example ribbiting queue instead of channels and then you can just star castrate a number of consumers yeah just one here sure I mean yeah the point I wanted to make here is a bit since whenever you use something like rabbit thank you Kefka and so on it introduces some complexity it's a distributed system many things need to be known configured and so on but yeah often people do this maybe sometimes too early when you develop an application that developers often throw message queues maybe too in too early and the point I wanted to make is that you can get quite far with a simple tool in common line tooling even and if you use go then you also have some nice built-in primitives how you can construct these data pipelines but inherently of course there is nothing wrong with using using a message queue if if you need that but maybe you can think about if you really really need it and if you cannot solve it also with the built-in things that you have have in the language questions five four three two one all right offer one last round of applause [Applause] [Music]

Info

Channel: GoDays

Views: 1,524

Rating: undefined out of 5

Keywords: #GoDays20, talk, goconference, golang, Berlin

Id: V-nSaWA0A3E

Channel Id: undefined

Length: 34min 22sec (2062 seconds)

Published: Fri Feb 07 2020