Hortonworks DataFlow powered by Apache NiFi

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good morning good afternoon thank you for joining the webcast this is tim hall vice president of product management for Hortonworks today joining me is joe wit director of engineering for us here at Hortonworks and we're going to talk today about the Hortonworks dataflow product a new offering that we are launching and appreciate everybody's attention during the webcast and if you have questions we have live audience messaging that you can lob your questions in and we'll have time at the end for some Q&A assuming Joe and I don't talk too much so with that we want to talk a bit about Hortonworks dataflow this is our new product which is fundamentally empowered by Apache knife I and we're going to talk a bit more about what knife AI is where it came from its heritage and with what problems it attempts to solve but to set things up obviously we're going to talk a little bit about roadmaps so I need to show you our disclaimer slide everything we do is done within the open-source community and as we work on features and additions and capabilities to the product obviously those things are gated by the community process that's run through the ASF so in case we need to work harder on landing all those things sometimes it may take a little longer than others to get everybody's agreement but general idea is that we're going to run through the standard meritocracy that's run by the ASF so agenda today for the webcast we're going to talk about the news new data sources and the rise of the Internet of anything and what we mean by the Internet of anything is any data anytime anywhere collecting that for both historical insights but also looking to take advantage of that data and how that data is flowing from its source all the way back to the data center we're going to talk about the data flow product itself and what its capabilities are key concepts its architecture and supporting use cases then we're going to talk about how Hortonworks data flow is complementary to http our hadoop offering and particularly looking at some qualities of enterprise readiness and streaming analytics and then we'll go through some plan enhance that are already underway within the attaching eye-fi community and then leave some time for questions and answers so with that I want to focus a bit on what's going on in the world of data so at the bottom you see the traditional data sources being generated from a variety of application systems like ERP CRM supply chain management systems human capital management systems and others those are all being stored into those traditional databases but where we see a vast array of information coming from these days is what we're calling the Internet of anything sensors and machine data geolocation server logs clickstream social media files and emails and many many more you can imagine every application on your phone notions of connected car any metal that moves today has sensor data and the opportunity is to take that information from its source bring it in analyze it and we're seeing just explosive growth in this area and the challenge is at least from the traditional way that Hortonworks viewed this as a hadoop vendor is how do we ingest that data how do we flow that data from its source how do we securely manage it monitor how do we address the change scenarios that are occurring with it how do we enrich it and that's been a big problem and so we set out to look for a way to address ease of ingestion doing that securely and ensuring that we knew about the chain of custody the data provenance where did that data come from so that we could trace it back now in terms of our friends at Gartner when they start talking about what's going on in terms of the interconnectedness and the demands of user centricity there's been discussion about the Internet of data and this is traditionally where the important works data platform is plays it has played powered by Hadoop at its core the analysis and cost-effective storage of that data is what we've been focused on where we're shifting our focus now and adding a new product line is for or the internet of things and focused on data flow securely collecting conducting and curating the data in motion but also driving value for the data at rest analytics and use cases that we traditionally supported through HTTP so the Internet of anything is really driving a bunch of new requirements from customers wanting to get trusted insights from that data at the very edge sometimes what we call the jagged edge where these devices and sensors live all the way back to their data lakes within the data center and they want full fidelity of that information over time of course sometimes that's gated by things like limited bandwidth occasional connectivity to the internet and others and so addressing those requirements is one of the things that we set out to find and then solve we're also looking at modern modern applications which need access to both the data at rest from a historical perspective but also may want to take action or what we call address perishable insights for that data while it's in motion in the world of IOT data flows are no longer unidirectional meaning it's not so it's not sufficient to just take that data and land it into storage and persist it we may want to interact with it dynamically we may want to change the way in which certain information is prioritized over time so there becomes a more bi-directional means in which you're interacting with the data and how its flowing from those sources all the way back to your data center and then last but not least the perimeter these days is a aggressively moving outside the data center to the jagged edge and that jagged edge is things like the connected car mobile applications things that traditionally have not you know sat within the data center and so the idea is how do we also address those kinds of environments in the context of all of these different use cases so of course there are limitations today and this is one of the things that we were looking at as a sort of Hadoop vendor the traditional data movement has really been built for that one-way flow traditionally and we know that we needed to address a more dynamic way and in which to deal with the data as its flowing the tools that have been built around this have been difficult to manage and sometimes architect actually disjointed meaning will build individual tools to solve individual data flow problems and bringing them together in a holistic environment with a palette of options that allows you to put these things together is what we wanted to address for business you know one of the biggest challenges for long-term historical insight and analysis is simply landing the data itself so anything we can do to ease that ingestion regardless of where that information is coming from is a huge win and of course we mentioned the jagged edge you know data flowing from you know sensors that may be on ships or airplanes or oil rigs retail chain stores that may not have you know high fidelity bandwidth we'd like to be able to optimize and optimize the flow of that data and use the the least costly means to get that information back to a central source and of course the biggest deal is there are insights that are going on that you would like to take action on immediately it's not good enough to simply have a historical view of the world how do we address things like a truck in motion who's a cooling unit goes out and the food is spoiled on the way to to a grocery chain we'd like to be able to detect that before that truck arrives and ensure that that food doesn't go on the shelves and so looking to address those things as rapidly as possible as it's part of the challenges were set out to solve so in order to do that work works went out and acquired a company very similar to the way that Hortonworks was formed where Yahoo was the original company and behind Hadoop and decided to open source as a dupe platform into Apache and then Hortonworks was founded as a commercial entity to provide support for Hadoop and the Hadoop ecosystem of component we found that the team from manyara had a similar journey and a similar makeup to us which is originally a patchy knife eye which is the technology many of their employees were working on was created at the NSA and then delivered to Apache through the technology transfer program in the fall of 2014 and then the yarra was founded just like Hortonworks Lars to provide commercial support around Apache knife eye itself and so in order to address some of those architectural limitations and those use case challenges that we mentioned before we found that knife I was an extremely powerful technology that that was very complementary to what we were already doing here at Hortonworks and we found the DNA of the the individuals and the management team and Inara extremely complimentary to you know our history and heritage here at Hortonworks and so we thought this marriage made a ton of sense and we welcome the onion are a team here a few weeks ago joining us and now we're proud to launch the Hortonworks dataflow offering again powered by Apache knife I to help us with it and jest and some of these other challenges so the way that this will work for those customers that may already be familiar with HTTP which is powered by apache hadoop will have a new product offering here dataflow powered by Apache knife I and the idea is to help us deal with the all of these different sources of new and emerging datasets help us extract those perishable insights and then integrate with hdt in terms of its abilities to store the data and associated metadata also providing the ability to enrich the data flow product itself with that historical information that we have access to these will be independent offerings from a support subscription perspective but of course we will deeply integrate and ensure that they work together very effectively and we'll talk a little bit more later about some of our ideas about where we're going to take this going forward so in terms of the Internet of anything in data flow you can think of hdfs or twerks dataflow as focusing on the acquisition of data and flowing that data to persistent store and of course HTTP is focused on both the process and analysis of that data on top of that storage so we get a nice story here to both acquire the data flow it to the right location and analyze it together and of course when you look at the realities of that simplistic view organisations today are global right we're spanning different gos there may be multi data center deployments for everything from disaster recovery to deal with you know other restrictions in terms of the regulatory environments and prepend dealing with personal data again we've got global organizations that have you know far-flung entities again retail chains out in remote locations or an oil and gas you've got oil rigs and trucks moving between locations you've got business partners that you want to share data with for interesting and compelling use cases and of course all of these things have a different velocity and variety of information that needs to be exchanged and of course bandwidth and latency of that information and how it can move around is one of the the critical challenges and so what we set out to define was technology that could deal with small footprints potentially operating with very little power being able to optimize the flow of both the control plane data as well as metadata over potentially limited bandwidth environments ensuring that the data would be available for those perishable insights and do it in a secure and repeatable fashion and so that's really the when we set out to try to figure out what was the right technology and team to work on this with us that's how we ran into the folks from manyara and of course apache now science but of course it's not only at the edge we've got plenty of data flow requirements within the data center itself and again this has been one of the challenges of you know many of the customers that we've been working with in terms of the do biz simplify the data ingestion problem so how do we understand the data and where it's come from from the various systems that may reside within the data center itself how do we provide them with agility to deal with changes in those new data types as they're coming in and changing over time which is again you can sort of set up the original job but as things you know move forward down the track things change and the ability for you to quickly deal with those changes and continue to flow that data for your analysis needs is absolutely critical we need the ability to handle dynamic access controls and security again these things change frequently as well you have different cross-cutting concerns in terms of being able to enrich filter and transform data and of course looking at this transition that many organizations are under from those legacy systems to modern 24 by 7 always-on in various events that are occurring and the variety of the format schema and protocols that are that are now in use today none of that is slowing down and so we believe that Apache knife I was a great and complimentary technology to bring into Hortonworks and we're excited to bring it to you all today so we're going to start by drilling down with Joe on knife I itself the core of the Hortonworks dataflow product and we'll start with the three key concepts Joe thanks Tim appreciate it and thank you all for being here today to listen to us talk to you about fort Moore's dataflow I just want to start with kind of a simple introduction to Apache knife by itself which really kind of frames a lot of the philosophy that we're going after with orton works data flow and everything that i'll describe for knife either there's kind of three main kind of central themes that we want to keep in mind as we go through this one of them is just fundamentally being able to provide a really solid management experience for how to control how systems connect to each other what happens to data as it moves through systems just fundamentally provide really solid data flow management the next piece then that's related to that but is its own concept is this idea of data prominence which is we want to record really fine grained detail about everything that happens to data where it comes from what do we learn about it what do we do to it you know where do we where do we send it to when is the the life of an object completed right sort of managing that lifecycle throughout the data flow and that drive some really important user experience to make the management problem easier but it also feeds into some really important bigger picture enterprise concerns like enterprise white governance and so on finally a really important part of this or piece of this domain then is providing rock-solid security on both the control plane and the data plane and we say this because we've been talking about this bi-directional data flow and part of this bi-directional story is being able to handle the data efficiently and securely but also the commands to change behavior and that requires us to have robust security approach for both spaces and we'll talk about these a little bit more as we go along okay so just going to describe at a high level some of the key features just to kind of put it in the right mindset of what a Hortonworks dataflow powered by Apache knife I will provide out of the box and then we'll talk through a little bit of the architecture and and do a you know look through the actual app through slides so the first thing to talk about is guarantees delivery as you increase the volume of data and as you deal with the fact that there are power failures and system failures and network issues and so on it's really important that you have a robust guaranteed delivery story that means providing transactional communication to wherever data is coming from as well as within the data flow system itself is in addition to where it's delivered and it's really important to keep in mind here we're not talking about a transaction between the producer and the consumer we're talking about a transaction between knife I and the system that it's getting data from as well as knife I and the system that's delivering data to most data flows really aren't linear right they form these big graphs where data comes in but then is delivered to multiple systems in parallel and so you have to have robots kind of transactional guaranteed delivery stories and knife I certainly does that I'll describe that a little bit as we look at the architecture here in just a minute the next thing then is data buffering and fundamentally the thing to keep in mind whenever you're in the data flow business when you're whenever you're in the business of connecting systems no matter what there's always some system that is down or degraded and that's just a reality and so you have to inherently support buffering to be able to tolerate those systems that may not be on line at any given moment time may not be able to keep up with the rate of data flow and so what this also means then is you have to have some important kind of first-class features one of those then is back pressure it's not enough to just simply buffer and definitely and then the data flow system itself becomes at risk so you have to have some way of effectively throttling all the way back to the point where you may in fact slow down the consumption of data now you may say in the case of my organization I can't do that right the producer is always sending data I have to always be listening and that's fine as well right you have to make a choice between whether you have back pressure or some sort of pressure release mechanism and so 95 supports both models because these are just kind of realities that exist in data flow the next thing then is if you have this data buffering very often what you want to be able to do is prioritize data it's commonly talked about today is kind of like natural ordering or insertion order and that makes sense for a subset of use cases but in the broad sort of enterprise sense as you're dealing with constrained bandwidth or constrained resources you have to make prioritization decisions and so the queuing mechanism the nightfly supports allows you to prioritize data dynamically and as Tim mentioned thinking about the the jagged edge in particular you can imagine that when your access to data exceeds the bandwidth you have to send it back you fundamentally have a prioritization decision to make and that's really what what drove this kind of design long ago the next thing to talk about then is flow specific quality of service and what we mean here is it's not enough to just tune the whole system to be you know low latency oriented or tune the whole system to be high throughput oriented I mean these are somewhat at odds ideas and and there's trade-offs being made and we want to enable people to make those trade-offs at very specific points in the flow and specific to what that particular dataflow needs or that particular consumer might need and so we allow you to control that a really fine grained level but at the same time we also allow you to control loss tolerance you may have situations where you have a developmental flow or a flow that if it's not delivered within some time right Tim talked about perishable insights for example if you have data which needs to be consumed let's say within seconds to be a value well once it's minutes old then you can terminate it and so we enable you to choose the points in the flow though if you do want to let data age off then it will do that for you you can think of that as being a form of that pressure release we described a moment ago we already talked a little bit about data provenance on the previous slide and I'll show you what that looks like in the application just to make that more concrete it's a bit of a unfamiliar term for a lot of folks it's not a new concept but it is a very important idea to discuss so I'll describe that further as we go ahead the other thing to keep in mind is we're building this this rolling log of really detailed history of what happened both to the content of data as well as the context or what we've learned about it and we want to keep that as long as we possibly can it's not just once data enters the flow I keep track of it and tell us out and then I get rid of it I want to keep that as long as I can you know real time is important being able to interact with the flow is important but it's equally important to be able to go back in time and sort of step through and understand and what happened and the architecture that we have the provenance data that we have allows you to do that in a really powerful way the other thing then is visual command and control think of this as part of that management story I was describing what we want to enable people to do is to have real-time or interactive command and control with the data flow this is really important because largely what people see today are designing deploy type systems where you may be able to build the flow that you need visually but you're building it kind of all at once and then you deploy it out the problem with a model like that is the feedback cycles too slow and the cause and effect is kind of blurred you can think of knife as providing this really immediate feedback to every change that you made and that's important because it helps you understand whether what you did had a good outcome or a bad outcome and if it's not what you were looking for then you can immediately correct it and kind of keep iterating now the key thing to thinking are to keep in mind there is when we say you know visual command and control people immediately assume kind of the human use case right the human actor making changes to the system but again if we think back to this by directional data flow story we're talking about here the interactive command and control doesn't just enable people to make changes it enables systems to make changes autonomous feedback is kind of the idea to think about what we want to be able to do is have analytic results immediately feedback to changing the behavior of how the dataflow works that could be changing prioritization decisions it could be electing for recovery cases to kick off it could mean adding new data flows or removing existing ones whatever the case may be everything that a person can do through the data flow we want systems to be able to do as well that has a lot of implications on how to design the system the other thing then are these flow templates which is to say that you can build data flows establish and test them and decide that they're you know sort of gold standard good to go and now you can save that as a pre-configured component if you will an already connected set of processors and relationships that allow data to flow a certain delay and now you can share that with other other teams you know configuration manage that whatever the case may be it enables sharing and reuse but on a on a higher level the next thing is that we have a pluggable authorization model today that allows you to take and use you know something as simple as a local file as a thority mechanism but also to call out to an external service think of something like Active Directory or some third-party authorization service that your mayor to have every organization has some form of one and so we just know from the outset that this is something that has to be pluggable and similarly we already support multi-role security things like users who can only read the flow or see it but not manipulate it and we'll go into this a bit more here as we dive into the security story in a few slides format another key point to bring up is that it's designed for extension anyone who's dealt with the data flow problem if you've ever made a script if you've ever run kind of traditional integration tools or if you've ever had to connect systems using you know whatever mechanism the case may be you know that it's never done this is an inherently last mile challenge there's always some new protocol always some new format some new schema something that requires you to go beyond you know all the tools that are already sitting in the toolbox and we we understand that from the outset a knife is designed to be easily extended in a way that allows you to build components that still show up looking consistent in cohesive in the UI and that allow you to have a lot of built-in fault tolerance error handling sort of consistent behavior the final point to make while Tim talked about it earlier is you know we want to be able to support those really small environments but sometimes you are running in a large data center and so you know we have a scale out story through this clustering mechanism we're not talking about hundreds or thousands of nodes here we're talking about a set of nodes all working cooperatively to feed data into the processing systems via storm spark or others where those would have you know potentially a much larger footprint okay so this is just sort of a listing of some of the use cases that knife AI has been exposed to over the years one that's really kind of easy or interesting to talk about is this predictive analytic scenario and when you think about the heart of what's happening in a predictive analytic case what you're trying to do as an organization is to acquire as much information as possible to give you the most accurate picture that you can get to determine whether a failure or some negative situation will occur not that it already has occurred but that it's likely to occur this is really important in industries or situations where you want to be able to keep systems or processes in operation and if you're able to provide you know upfront maintenance or things that help avoid kind of a longer-term outage then you can save a lot of money right it helps greatly with the bottom line at the heart of that then is a very important data flow challenge both out at the jagged edge as well as within the data center from a jagged-edged perspective let's say you're capturing information off of things like you know cars or trains or planes whatever the case may be and in every single one of those cases you're capturing information and delivering it over highly constrained comms you're also running in environments where you don't have access to a multitude of servers or even if you do have you know compute resources they're going to be very limited they're going to be small CPUs they're going to be limited memory limited storage and so the challenge there is acquiring that data as reliably as possible prioritizing it and delivering it back to the data center once data arrives at the data center you now have data coming in from a variety of sources ideally multiple sources potentially even about the same concept right so that you can get a bigger picture right provide more context to provide or to have kind of a richer predictive analytic kind of success probability right and in that case you have the agility challenge that we talked about previously which is how do I manage the flow this information right now how do I understand what's happening and then be able to affect change and so that plays to a really core strength of knife I like we've described with this interactive command and control but let's take this a step further in the predictive analytics case where you're trying to detect failures detect impending failure what if you miss it right what happens if you did not detect the failure and chances are the reason you didn't detect it is because the data that you were looking for didn't arise or you were looking for the wrong information in the first place well one of the really important features that knife AI provides is this provenance concept that we've talked about what that allows somebody to do is go back retroactively Li and select to have data that may have previously been considered lower priority and not made at home to have that be recovered and re delivered that drives some really important kind of like diagnostic analysis use cases so that you can improve your predictive analytics and data flow change so there's a really kind of strong story there based on the interactive command and control as well as that provenance and recovery and replay case and then of course we haven't even touched on it yet but there's also the security aspects how do you trust and understand the origin and attribution of that data having a fine-grained chain of custody or data prominent solution is how you do that and these are just foundational features built into Apache meta okay so we'll take a brief moment to talk architecture and really a couple of key points that we want to get across here on the left think of this as a simple note it could be something as small as you know one core device maybe one or two CPUs very limited Ram maybe 512 megabytes to maybe a couple gigabytes could be something like a flash drive a really small flash drive or SSDs or spinning disks some sort of really small device or just limited footprint enterprise server right and in this case we're running in a JVM within that JVM we have a web server and that web servers how all the commands come the knife I goes through a RESTful API that's what dictates what happens to the flow it's also how you can observe what some you know is how you can look at status behind that then is an engine think of it as a sort of processing engine which is running all of these extensions to acquire data to assess some relative value make prioritization decisions and then ultimately deliver on to follow system follow-on systems behind the scenes though so that we can provide guaranteed delivery so that we can do really sort of high throughput and low latency processing as well as keep track of this kind of chain of custody for data we have three repositories one keeps track of the metadata sort of about the objects going through the flow the other keeps track of the actual raw bits themselves write the messages or records or tuples whatever you have coming off of source systems and then finally we have a provenance repository which is where we keep track of all the lineage information each of these are separate because they have different kind of life spans different kind of design principles behind them they're sort of solving different parts of the problem but if you think back to that recovery of diagnostic scenario I just described in the predictive analytics case it's that provenance repository that would be queried to look for data that was not delivered and then it can be recovered because it links to the content repository and we keep the content as long as we can we don't age it off until we have to based on size requirements are based on age requirements so if you take that same simple story that same exact user experience but now we want to scale that out and so now we have this clustering mechanism that allows us to have a controller which is replicating request all the nodes who operate the same way they all have the same behavior same configuration they're just receiving different slices of data this is how we can scale out and yet provide the exact same user experience it's important to talk through or think about how we ensure kind of high availability or durability of the data flows even if a node drops out of the cluster the cluster continues on by using protocols like ample 9/5 provides a site-to-site protocol that handles load balancing and failover automatically shares information about what's going on in the flow allows the client to either push or pull both on giving data to 9/5 as well as getting data from knife I and so your data flow continues running despite a node dropping or if you need to add a new node that's fine you don't have to pre coordinate that we want to enable operations teams to be as flexible and agile as possible if that cluster manager goes down for any reason the nodes in the flow continue operating you can't bring up the UI at that point but as soon as you bring the NCM back online again you can see the flow and so we've taken a lot of steps to make sure that despite a range of potential failure cases that the sort of business case this data flow requirements that they keep running okay so just going to walk through a few screenshots of what it's like to operate in Hortonworks dataflow powered by Apache knife I just to give you a visual sense of what's happening this is something which frankly is a lot more fun to just download on your desktop and try for yourself what you're looking at here is a screen shot of knife I want a running and web browser any html5 from the web browser should work well the screen that we're looking here you can think of that as a blank canvas and that's essentially what happens when you start knife eye for the first time is you have this engine sitting there ready to do something and now you can start to design the flow that you want and so that's what we have shown here on this next slide you select from the top left you click and drag down into the graph where you want to add a component and now you start the process of selecting which component you want to add and so for example today there's about 90 processors and Apache 95 to come out of the box and what you can do is search based on their name or based on tags and these are to do things like interact with databases to get and push perhaps or to interact with HTTP or JMS or other open-source projects like Kafka and solar and so on and once you've acquired data you can do things like enrich it and make routing decisions and transform the content things like converting CSV to Avro or converting Avro to JSON there's a whole range of kind of scenarios that you can walk through there's also support for interacting with kind of legacy systems like if you just have to pull files off of a file system or if you want to store files somewhere or say you want to set up an alerting flow you can send emails however there's a whole range of these kind of Lego's if you will that you can throw on to the chart on to the flow and builds so we pick one of these processors for our flow in this case we want to grab the processor that interacts with Twitter and so we've typed in Twitter we see that we have this get Twitter processor and so we click add and now we can start configuring that processor so it's on the graph and now I want to give it some properties when we do that we're able to specify what type of Twitter endpoint we want to be for example any terms that we may want to filter on a whole range of things that are specific to what that processor does and so you adjust those parameters and now you you have this processor on the graph it's ready to go and actually acquire data and start doing some processing on it but the next thing you want to do is deliver that data somewhere so in this case we're going to put these tweets into HDFS so we're going to drag and drop that processor icon again onto the graph and type in put or HDFS or any number of tags that will quickly get us there and once we've done that we can grab this put HDFS process and we can configure it we can tell it where to find the configuration file for interacting with HDFS directories that we may want write to how we want to handle certain conflicts and a range of other kind of HDFS specific tuning parameters once we've done that now we have two processors on the graph and now it's the matter of dragging a relationship from get Twitter to put HDFS just click-and-drag when we do that we now get to decide behavior of what happens on that connection this is where we get to set things like prioritization let me go back this is where we can set things like how we prioritize the data whether we want backpressure to kick in so let's say we're having trouble delivering the HDFS fast enough and we want to stop pulling in the tweets for example back pressure would automatically do that if it hits some triggered threshold then the previous processor would start slowing down stack pause is a really nice kind of natural propagation all the way back to the source so now we have this stuff wired up we have our get process or get Twitter processor put HDFS there now connected and we can go up to the top of the graph and just hit play or you could play individual components whatever the case may be you can do that you can build it while the data is flowing as much as you like and immediately the data starts flowing the UI is giving you rolling five-minute window of information about what's happening you know basically the past 5 minute window of what's happening statistics about how many fights have been read how many bytes have been written how many objects in total are being dealt with and this gives you really important information as an operator wanting to understand what's happening to my system at runtime what's happening to my system in production which parts of the flow are using the most resources we're actually counting the bytes that are being read by the processor and written by the processor to the underlying repositories that information is really valuable so that you can see where you know what parts of the flow are consuming system resources we're also tracking things like how many tasks are run how many times do we give that processor a thread to execute and when we do how much wall clock time is being consumed to do that function really important information that if you're an operator allows you to understand what's happening to your system in production or if you're a developer it allows you to see and really find fine-tune latency behavior as you you know pick different configuration settings so as I've mentioned a little bit here you can already start to or you can immediately start to dynamically adjust these properties so you don't have to stop the entire flow or deploy an entirely new one you can go to very fine-grain points of the flow and tweak configuration settings you can even add new processors on to the graph and just click and drag the same relationship again and now now if I send a copy of the data to the other processor as well so you don't interfere with the existing let's say production flow now you want to start building a new one this is really valuable because there's really nothing quite like production data to tell you whether or not you're on the right track and we don't want to have to actually copy bits under the cutters so this allows us to very efficiently do you know high scale processing and that's something that we can talk about more as we go forward all right so I'm going to take a moment just to talk through the data provenance piece this is one that you know you really just want to download knife ID and try this yourself so you can see and kind of get a very real sense of what this means because it's a very big deal so while this data flow is happening we're pulling in tweets we're sending data to HDFS knife is automatically recording the data that's being captured information about the data being captured where it came from what time it occurred pointers to the exact content at each point in the flow and so if you just go up to the top right part of this graph you can click on this data provenance icon and that brings up a tabular kind of a almost like a database table looking dump of provenance events you can then start searching for events of interest once you've done that you can click on any one of them and have it tell you not just looked at kind of a traditional log structure of what occurred but now actually click on a data lineage icon which will then have the application compute and visualize the actual graph or pass that the data took through the system and so now you can see exactly where it was received and where it was sent and if we made any transformations like if we you know converted the JSON to Avro or if we extracted elements out of that JSON and created some other format or if we did encryption or compression or decompression whatever the case may be there'd be a provenance event for each of those changes and now you can right-click on each of those events and look exactly at what we knew at that time you can follow the kind of Chrono Cross events that occur and see precisely how the data flow unfolded so here's an example of looking at the details of one of those you know dots on the graph if you will now keep in mind we're looking at historical information but this is historical information that could have literally just occurred you could use this essentially watch a data flow unfold if you will and so now here we're looking at the details of a specific provenance event you can actually click on the content tab and what this allows you to do is view the content as it existed precisely as at that point in the flow think of this like a finite state machine with persistence behind it so that you could roll back time and kind of step through precisely what happened really cool really helpful if you're doing things like transforming data from one format to another which is a notoriously unpleasant experience simply because people can't see what's happening this has address to that problem because now you can view before and after makes it really easy to understand whether you're on the right track it also means if you're not let's say you did this transformation and it wasn't quite correct it means you can fix it and then click on this replay button it will take the exact same source content and context but run it against the new configuration now you can look at that result so just think about what that means for like iterating essentially in real time until you get it right and then you're good to go so here we're actually looking at the content I'm sure that that's quite difficult to see frankly what's on the left in this case is an actual tweet so we're looking at the payload of the tweets this is a JSON object that came from Twitter for example but we can click precisely to that content here you can see what it looks like as a flow stirrer to look more realistic right like I said before it's not just linear chains these things expand out okay so we've talked through some of this there's also auditing information for all the actions users are taking to manipulate the flow recording what they're stopping or starting or making configuration changes to this is really critical if you're dealing with uh maybe a large data center configuration and you have multiple people or a large team or even multiple teams operating on the same system you want to be able to see what people are doing and when they're making changes okay so that was a just sort of a quick intro to what the user experience of knife I looks like talking a little bit about the architecture and now Tim is going to help us talk about how hdf and HTTP play together yeah so one of the things we've been focused on for HTTP powered by Hadoop are the enterprise readiness capabilities operations governance and security and one of the things that we loved about patching I finds or bringing it into a new product is that it already had some of those capabilities built in and so what we're looking at is we're going to talk through and those enterprise readiness qualities and then we're going to talk a little bit about where we want to go from here in terms of extending its operational characteristics extending provenance into the broader concept of data governance and then discussing a little bit about where we'll head from a security perspective so I want to start with Joe talking about existing operational capabilities and characteristics of MiFi itself yeah so today knife I supports essentially both a push and pull model for a lot of the really critical information so today we can push for example provenance data and statistics out through our reporting tasks API this would be somebody writing code that then runs in knife I to look at this information then push it out to an external service atlas for example there's also the pull model which is everything that we have exposed this restful api so an external service could essentially pull knife I to grab the statistics and be on their way this is nice for people that are running maybe scripts or they have some other system that is you know it doesn't have an agent available to it but it still allows it to get data check on status that sort of thing as we've talked about knife I already supports dynamic data flow changes which is obviously really important and we've described how that REST API allows us to do that both for people end systems site-to-site is a very important protocol and concept that allows to knife eye clusters or to nine fine ODEs or a client interacting with knife by using the site-to-site protocol to interact with knife by using a fault tolerant and scalable protocol but it also means that for a data flow manager connecting from one data center to another data center for example they don't have to like switch their mindset from what business thread they're operating on enrichment they have just done and now all of a sudden has to think about JMS or FTP or SFTP or some low-level protocol their mental model is connecting one processing thread at a data center to another processing thread at a data center it keeps them at the right level of abstraction we've already talked about the fact that it's extensible and we've put a lot of effort and energy into optimizing the user experience making it really easy for operators to understand precisely what's happening in the flow we don't want people to have to go hunting through log files to figure out what happened we want to expose that in a natural way data flows are graphs and we can show those graphs and allow you to interact with them so the question is where might we go from here so obviously HDPE is powered by Apaches doop but of course there are more than 20 different components that exist within HTTP today HDS the Fortin works dataflow product is going to start with patching 9/5 powered by 95 but one of the first questions that we're getting asked and asked over the the chat as well is can I explore the capabilities of Apache knife by today in the context of the HTTP sandbox and the answer is yes you can so for those that want to explore that if you go to the Hortonworks gallery and you can either google it or its Hortonworks - gallery on github do you'll see if you click on the embarks tensions link you'll see the ability for you to deploy an instance of Apache knife I with the HTTP sandbox and you can click on the step-by-step instructions to set that up now we're going to show you now is what that actually looks like from a nebari perspective you'll go in and you'll go into the actions tab here in the lower left-hand corner you'll add a service and in that service service wizard there's an ability for you to check the knife eye box and then you'll go through the other configuration and optional configuration parameters in that service wizard and you go ahead and deploy and light up knife eye now this was put together very rapidly by our brilliant partner solutions engineering team as an extension extensibility point of them Barty using the embaressed axe and again it's a great example of how to bring these pieces together extremely rapidly now for typical knife eye deployment perhaps on a jagged edge is a very simple deployment model where you basically on tar the the distribution itself and we're going to be focused on making sure that we can continue to support the ease of deployment in both the jagged edge and the data center environments but this sort of paints the path for how we'll take HD p HD p and HD f forward in terms of a consistent operational experience if we decide to add additional component tree to HD F down the road we'll build a stack definition that will allow Ambari to effectively manage and deploy all of those components for those of you that are already familiar with it today so now we're going to shift our focus Jo back to governance and the need for for provenance yeah so we showed you a little bit mechanically or a little bit about kind of mechanically how provenance works in 95 but let's back up a second and think about why is this important from a bigger picture perspectives for operators it means that they get really fine-grained traceability and lineage to walk through and understand what happened to data flow and that's essentially what we showed by looking at the UI and knife eye we also showed and talked a bit about what that means for recovery and replay these are things that are just really frankly very helpful the thing that people tend to most often gravitate to though is what it means for compliance it forms this really rich audit trail of not just what people are doing but what's happening to the data to data as its flowing between systems or within systems and that information becomes really valuable so that you can keep track of did you deliver the right data or in the event that you may have delivered the wrong data or not you know sanitized it enough that you can understand what systems you deliver to and precisely what data you delivered them it means you can start to actually remediate in the event of you know a compliance incident or really more completely understand that compliance issue and on the flip side it also means you can prove you are doing the right thing which is equally important the part that's most exciting though if you kind of project forward and see how this unfolds across an enterprise you already see really strong provenance features within HDPE which Tim I think will we'll talk about a bit but what this means is we can take that same story and push that out further to the edge we can provide that provenance for the data and now if you have an analytic result you can start to understand what sources were used to produce it that means you can start to value your data sources particularly important if you're having to pay for them it also means you can start to value the IT systems involved in that processing chain how long did they take processing data all of critical business insights that you generate which systems were involved in perhaps more importantly which systems were not involved this chain of custody really starts to expose and show you know where the value lives across the enterprise and so related to that people are asking questions on the chat about data governance and what this means so we started the data governance initiative back in January 2015 and out of that came Apache Atlas and again Apache Atlas is focused on the dupe ecosystem of products and making sure that we could do set based governance and metadata capture from the source systems all the way through to the Duke and what you see here now is that we will be adding the ability for you to do the event based lineage from the Internet of anything using hdf and the provenance capabilities coming through Apache knife I and figuring out how we connect that with the set-based capabilities that Apache Atlas is delivering other questions on the chat are asking about what what the difference is between provenance and governance and so the way I like to describe that is governance is a broad term that is focused on the transparent reproducible audible and consistent access and visibility to the data who touched it when did they touch it why did they touch it what happened to it and provenance is really focused on one aspect of it and so we think these two things fit very nicely together and we'll be investigating more deeply how to connect them in an effective fashion going forward one of the things we do know is that there's a limited persistence capability of the typically of the nodes where what our knife I may be running and so the idea again is for historical insights into both the data and the the metadata and that's captured by the data provenance capabilities we may want to flow the all of the information from the Providence repositories into an operational data store that lives inside of HTTP and then use that to do interesting reporting and analytics going forward but the odds of jamming every event piece of event lineage into Atlas is unlikely at this point but we'll go deeper in terms of putting these pieces together and providing a compelling integration of those parses to go forward so next we want to look at a little bit off on security so key point we want to get across here today is the security stories about a lot more than just encrypting communications right you want to be able to make really fine-grained and real-time decisions about whether a person or a system is authorized to have or see a piece of data and that means that you need to have you need to keep and use tags on that data as well as information from within the data itself and marry that up against authoritative sources at real time or at runtime so that you can make immediate decisions about whether that data can be delivered there when you think about a real enterprise these are things which are changing all the time particularly in large data centers where you have a multitude of systems or new systems come online it's really important that you can honor kind of the current entitlements and accesses of those systems and so that's something that knife eye does at the outset and that's part of the reason why we've taken pluggable authorizations so seriously and then from one of the things that we is doing from port works perspective is talking through the the key qualities of data Security Administration authentication authorization audit and data protection and of course knife I comes out of the box with a bunch of capabilities across all those dimensions so we like the consistency here in terms of describing the story and we'll also look at means for extending these capabilities as we go forward and of course one of the things to keep in mind and this is some of the questions that we see is hearing over the chat and we're going to dive into you know what's the relationship of hdf related to storm and Kafka and spark streaming and in a minute but the idea here is that hdf will be focused on the data in motion and extracting those perishable insights while the data is in motion and HTTP the data platform itself backed by Apaches dupe will be used as the long-term persistent storage and for those historical insights so imagine if you will you've got multiple knife line notes that are running out there each of them collecting information from various sensor data across the internet event anything your independent analysis of individual events can be done on to some degree on an individual knife i know'd but if you wanted to look at the more complex chain of events across those different sensors that are being delivered by each of the knife I knows you are going to likely do that in a stream processing solution like spark streaming or storm itself so that's really one of the sort of fundamental you know big differences is looking at individual events and event a gregarious looking across collections that's one of the simplest examples that we can give and so just going to talk a bit about what's going on within the community for knife I and enhancements that are coming off I'll be super brief so that we have at least 30 seconds to address quite a few questions one of the things that people have been really consistent in describing that they're looking for better configuration management data flows there's a multitude of teams across the enterprise managing dozens or many more night by clusters particularly if you look at large enterprises that number even gets quite larger and those teams want to be able to share information better they want to be able to share extensions and templates and so on and so we're going to be tackling a lot into that if you look at the link on the bottom of this page you can go to the Apache knife I communities wiki where they talk about knife I feature proposals and a lot of these are all of these are described there now you can get a better sense of where we're at what kind of gear is they tie to and some of the mailing lists discussions that the community is having about where to go with these things related to that then is providing a extension and template registry so that you can have a single place in the enterprise where you store these things and now you can connect to your knife I instance and say hey I want these extensions I don't need those but I would like to pull in this template oh hey look I built this new geo enrichment flow I want to share that so that other people can do geo enrichment similarly Avro something that we see used a lot a lot of people in the community have been very actively asking for this and so we're building a lot better integration to make it easier to work with we talk a lot about interactive command and control but we want to take that even further we want you to be able to literally watch data flow through the system step-by-step and today you have to do it through the provenance mechanism is great but even waiting you know that many seconds is sometimes not enough like people want to be able to watch data move from cue to cue and so we want to provide interactive cue management so you can see exactly what's sitting in a cue how the priority is working and be able to immediately click to content and attributes right in line we have a really strong authorization model today but what we don't have is a a deep story for multi-tenant authorization and what we mean by that is the same flow operated on by multiple organizations simultaneously but that their permissions are separated and so we're going to go we're going to tackle that as well we have a few other things that we don't have time to get into now but um you can look at those future proposals to get a better view so we're going to wrap up most popular question on the chat Joe tell me about the relationship between hdf Apache nya and Kafka yeah so think of Kafka as a pure messaging broker right that is you have producers providing data to topics and then you have an arbitrary number of consumers that can pull from them it's a very important part of the data flow story but it is essentially tuned for the space where you have a domain of agreeing systems who all understand a similar schema and format and are willing to use a specific protocol with knife I we're looking at the broader enterprise data flow management story and so these are two systems that you'll see working side-by-side collaboratively for quite a long time we actually see capita as being an important part of hdf itself as we go forward ok with that we want to thank everybody for attending today's webcast appreciate all the questions we try to answer as many of those in real time during the chat as possible and appreciate the ratings and feedback that you've already given look for announcements forthcoming related to the general availability of hdf and related support subscription from Hortonworks thanks again have a great day
Info
Channel: Hortonworks
Views: 23,291
Rating: undefined out of 5
Keywords: Dataflow, Hortonworks (Business Operation), Apache Hadoop (Software)
Id: fO-xOrWBZJU
Channel Id: undefined
Length: 61min 16sec (3676 seconds)
Published: Wed Oct 21 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.