Data Profiling and Pipeline Processing with Spark

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good morning everyone it's great to be here at the SPARC summit and excited to have this opportunity to talk to you today before I start I just want to do a quick audience poll how many of you are currently in or have been in the trenches doing data munging wrangling cleansing and all of those good stuff okay so hopefully this talk is for you I can assure you when this journey began I had a full set of hair but then again you know correlation is not causation so all right Who am I I'm the senior director of Big Data platforms in framework at synchronous we were recently acquired I mean I was with razors I recently acquired by synchronous in late last year being in this space for a long time and then my goal is to solve real business problems with the latest technology that's what everybody wants to do it let me say that anyway so how many of you heard of synchronous asynchronous is a publicly traded company we're headquartered right 45 minutes away in Jersey and we offer personal cloud and activation platforms for large enterprises and communications providers around the globe so what does that mean so if you use a mobile device any mobile device chances are synchronous is actually working behind the scenes for you so whether it's activating your device on the network whether it is migrating content synchronizing contacts whether it is setting up a personal cloud environment for you to move content back and forth your pictures your videos synchronizes platform and software is enabling all of that so our solutions help operators connect their customers so whether you're onboarding a customer in the form of an device whether you're allowing them to synchronize data back and forth from the device to the cloud to other devices or and if you're driving a connected car the latest cars are four G's in them synchronous activates the connected cars if you have a connected home synchronous probably activates a connected home so that is the ecosystem we are in and razor sight used to offer predictive analytic solutions to the communications vertical so the marriage is all about applying our platform and products and models to the solutions that synchronous offers so we are part of the synchronous analytics group just to give a sample of what is Big Data at synchronous what do we do this is just a sample for one Operator large operator who's deployed the personal cloud solution we're talking about you know 30 million active subscribers on the app for 8 million daily active subscribers or users they are uploading anywhere from tens to hundreds of millions of pictures every day so the data size is staggering we have deployed the solution across five data centers running on multiple clusters and all the good stuff so this is truly big data all those events coming from those devices that can be used to improve the customer experience whether it is looking at better application functionality whether it's looking at crash analysis predicting failures rolling out of applications and release versions all of those good stuff can the data can be used for so what does my team do we are responsible for the big data platforms and frameworks that is used to generate those consistent analytics the platform is deployed both on a private cloud and in public law AWS infrastructure and when we talk about analytics here we about the full range so it starts with the traditional descriptive analytics or bi world and the advanced predictive analytics so both end of the spectrum are there we have internal users your customer users who consume this this insights generated from the data you should be very familiar with this in order to make any meaningful use of the data it has to be processed so right from ingestion all the way to profiling parsing transforming enriching aggregation etc down to the downstream processes which may be visualized it which may apply models on top of it this is what we're talking about data pipeline process and we're gonna walk you through what we've gone through I mean what does this mean it's not simple people waffle over it this is where we spend most of our lives data is not necessarily clean they does not necessarily structured semi-structured the definitions are missing legacy systems is all sorts of things happening in here so our journey started with a version one back in the day new folks should be very familiar with this this is the day of the multiple ETL jobs running outside the context of data storage and processing was separated things were running in long-running batches and never we encountered any large volumes of data set the latency increased there's no support for unstructured data historically speaking these sort of solutions took a year to put in place and it's pretty expensive inflexible large teams working across we could not store large amounts of data online because of obvious restrictions so this was the the life back then then we entered into an appliance world where okay we will put the storage and processing together in one vertical as appliance it was great performance improved latency reduced but cost increased still we had to do it in batches or the low latency batches still we couldn't support unstructured data still the costs were so prohibitive that we couldn't store the data there we had to do it just just in time processing maybe store limited amount of data and move it out somewhere else didn't work out so a move on to the next version then the whole Hadoop thing came about and say let's go there we looked at that and he said that you know there's a big skills gap here to have a bunch of people who are familiar with certain technologies that to migrate into that we said let's take a pause and see where this is headed before jumping in and we saw MapReduce Pig hive and a whole bunch of other acronyms so we said let's take a pause on that it didn't want to do a technology migration for the heck of it what we realized was the the benefits would not be there immediately so a couple of years ago mercifully out came spark so spark provided a promise it had everything required for pipelining it had streaming it had badge it had sequel access it had rich features in memory storage so performance was better so we said let's take a look at that and let's centralize our pipeline process on this platform so this is what we call our v4 data pipeline details are closer to data we can process stream or batches the superior performance compared to MapReduce and other options what we did this time is we didn't want to open it up for every single developer or app developer out there we abstracted it and built a framework so we said all the components needed for data pipeline processing let's build components and then expose those components those app developers for them to hook it up in a data pipeline process so it simplified the design it significantly reduced the time for us to roll out a solution and it it was highly flexible for us to so with that I'm going to go into data profiling this is interesting good old days we had volumes of data and when you want to do profiling especially for the modelers they would say that take a sample and then we will profile the data and a sample and then utilize it to build the model then and the big data world came all the conventional wisdom is no use the full population don't use the sample run your model train your model on the full population okay then there are others that say put everything in the lake and then somehow everything will work out how does that work you still have to go through what the data is to still need to understand the the constructs so why do we still we still need data profiling we need to understand what is in those data sets we need to understand metrics we need to understand the risks associated with creating rules I mean when you want to create an analytic data set oftentimes you have to stitch the data to create that analytic record which then could be used by the modeler so when you stitch it how do you generate the right rules how do you make sure the quality of the data is good can we identify the metadata from the data set so that we can create those configurations that can be used in the pipeline process instead of manually hooking it together how do we understand the challenges of data and inconsistency ahead of time so anything you find later in a cycle is always more expensive and tough to to fix also another category of solutions came where you want to do ad hoc search ad hoc full-text search so in order to do those you need to tag the data how can you do tagging of data categorize the data without profiling so profiling became a key aspect of that as well so when we looked at this challenge and this is where most of the time was spent if you broke down the project lifecycle managing wrangling whatever term you want to use that's where most of the time was spent and if you really broke it down there was a lot of touch points so from the time the ingestion a location data was moved to some other location and there were many policies security concerns so moving data here and there was not possible and as a result of all of that there was you know just the interest level in any project is how soon it can be delivered if it takes two months and multiple months in the the opportunity is typically lost so we wanted to address this particular challenge so I'm the typical scenario I'm not saying everybody has this but I have seen this many a time we have analysts business analysts one will get a bunch of data into an Excel or somewhere else and look at it big data you cannot do that okay so we will put that in a database and run our profiler but then you cannot put it in a database unless you know what the data is what the schema is what the structure is we get data from customers and they say this is what it is but it didn't just nowhere close to what it is so that went cycle time there to figure out what what exactly did you send a step a thing you you couldn't move data back and forth that support the fundamental problem you could not move it from a data Lake location into a database or back to some other store so all of those dependencies were causing a huge headache so what did we need we need speed we need agility we need automation how do we automate this thing how do we put the power back with the business analyst or data analyst so we set out with this is the minimum data profiler requirements we said all data is going to reside in the data lake so you should be able to profile the data in the data Lake you should be able to review and validate the data you should be able to review the statistics so the data you should be able to use that same results to create metadata to run your data pipeline processing you should also potentially be able to create down stamps downstream schema so if you're going to load this data into an index or into a downstream database you should be able to create the schemas automatically this were the goals that we set out to achieve so SPARC came to the rescue we have large data sets multiple data objects we can move that in he split it up by field and ran all sorts of metrics we can use built-in transformations with spark it was very nice performance was great so how does this work so the simple flow is we have a very usable web application the user all and the user does this points to a data like location and say pick up a set of files based on masks the full set of data objects and then launch the spark application which then runs in the background profiles the entire data set and publishes the result to a repository which is viewable in the web application pretty simple what it generates is a bunch of univariate statistics whether you are a numerical field or a non numeric field there's a whole bunch of things that is needed by the data scientists to say how many nulls do we can we create imputation rules out there what's the health of those various attributes or the histogram kurtosis mean median all those good stuff comes out which can be used this can be for an individual data set it can be for any data in the data like it could be merged data stitched data enriched data at the end of the day these things are important before you can start the modeling process so this is a sample screenshot what it looks like this is a simple angular web app and go in there and pull up a result of a particular data set profile give you you know color coded health of a data field green orange red or however you want to set the threshold I will present all the statistics about the data field whether it's you know amount of now histograms your box charts things like that so in language that is very usable for the data analyst or business analyst or data scientist it also generates as I mentioned a full-fledge JSON metadata so when you process the profiler the profiler looks at all the fields it does not only it understands it generates the datatype it generates the the content statistics it also generates the the JSON metadata which then is used by the data pipeline workflow so if you want to operate on that data set to transform it enrich it you could use this metadata to drive that it also generates schemas downstream DD else automatically from the profile output so user doesn't have to go in and create all of that some data sets are very large and there are 40 50 of them you can see how much time can be saved by just profiling it creating the data set I mean the DDM the advantages are you can all the source data already is in the data like it's been dumped into the data lake for the data location all the profiling can be done in the data lake there's no need to move the data back and forth you can profile the entire data set you don't have to work with the sample you can integrate the the results into a metadata configuration or a downstream DDL all of this saves tremendous amount of time it might sound trivial but for those of us go through this for a living it's a lot of time so objective is to send cleaner data down to the modelers because at the end of the day if you want to generate rules if you want to generate enrichment the data pipeline process can be built accurately to cater to the needs of the the data scientists downstream so we've seen significant improvement as a result of this sort of approach what used to take weeks sometimes days now it's cut down to hours the overall data pipeline process has been reduced 80% I would say which is why we say from the time we receive the data we can put out full-fledged metrics in the form of you know let's say dashboards and descriptive insights in under a month we have identified data quality issues sometimes that trip us up ahead of time empowered the business analytics as well anyway so I want to quickly go through this this is just one component of our pipeline process profiling is the first part but when we built the stack from the ground up with spark we said you know what we need a multi-layer architecture each layer logically performing a particular function right from ingestion to data storage to data processing to modeling to integration to consumption it's a it's a pretty layered infrastructure there and this is the architecture we have in place today we just talked about the data management layer so the framework component the Allspark components at least in the profiling parsing transformation integration layer each component has a set of functions these components can be hooked up in a simple Oozie workflow completely configurable to metadata so the building blocks are available for the app developers they don't have to sit in and write all those transformations in fact in the profiling and parsing we have our own scripting engine integrated into that so it's very easy to transform data right some cleansing rules lookups substitutions imputations all of that are very easy to do with the set of framework approach if you look at the architecture we have the data Lake we have the orchestration layer which is which is Uzi and then we have all the the green boxes there are components in the pipeline whether it's sequel engine or data prep engine our database loader using scoop or partitioner whole thing is built-in in a component fashion and then if you look at the task stack itself we have a certain certain software in there we've used elastic search for index storage for quick retrieval ad hoc analysis we have the data Lake just a map our distribution you spark extensively the data processing arena we have our own angularjs visual lair what's next to be expanded we continue to expand our component said move more into the value aspect of it bivariate analysis multicollinearity all of those things that typically is done on the data we want a component as that string it together in the in the data pipeline I said as after the univariate so the variable creation of the analytic set creation want to zip it through here so the lessons learned here is let the business drive technology adoption there are a lot of hidden costs plan incremental updates deliver something to the business periodically simplify the whole thing framework based development is very very helpful to speed up delivery and also reduce our overall cost I mean at the end of the day what our customers need is what is on the right I know when a customer calls into a contact center they want to know what is the lifetime value what is it churn risk what is the profitability that's the kind of information they want all the stuff on the left that's the Big Data stuff right so we are all all about delivering what's on the right to the customer to to use the data insights to better their business with that I will end my talk thanks for the opportunity you
Info
Channel: Spark Summit
Views: 10,610
Rating: 4.7473683 out of 5
Keywords:
Id: r7SF5WldITk
Channel Id: undefined
Length: 20min 57sec (1257 seconds)
Published: Mon Feb 22 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.