End to End Machine learning pipelines for Python driven organizations - Nick Harvey

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'm gonna be talking about how you can build and and kind of machine learning pipelines in pachyderm but not in the way that you kind of typically expect right what I'm gonna focus on mostly today is how you know machine learning pipelines in general are missing a big thing which is you know something called data provenance which I'm pretty sure a lot of you are familiar with let me get started so as data scientist we love you guys know who that is right I'm not that old right okay cool that's data from Star Trek if you were wondering my geek credentials there they are I just used a Star Trek reference in my presentation but more importantly we love big data right and then we love it when it gets combined with machine learning right so these these presents a pretty powerful options especially I'm just gonna take this off the mic because I like to move around especially companies out there right they're thinking if I take this big pile of data that I've been collecting for years and I apply it to these new or apply to this new field called well not new but this field of machine learning all of a sudden I get these mind-blowing insights from this stuff and I'm just gonna hire a bunch of these people called data scientists I'm gonna give them access to data and I'm gonna let them run right and let them go and so packet are we deal with this a lot actually so companies have put a lot of investment into machine learning and they've hired a bunch of data scientists and they've given them all this data and they're kind of just sitting there at their big desks and wondering when am I gonna be Amazon right what am I gonna be these next big players it's because like immediately when people get jumped in they get dumped in they immediately go to these two steps right I'm gonna focus on connecting data to maths and then I'm gonna expect some sort of output and then I'm gonna be Amazon or Google or open AI and so then we start to think about maybe some inference and prediction if we're getting real fancy but in actuality and I think you guys understand this the most out of anyone there's a lot more steps involved right when we start thinking about a complete holistic machine learning pipeline and these even aren't all the steps right there's still a lot of things to be done and there's still a lot of things that you as a company may do differently on different steps but one of the biggest challenges that we really see is in that in order for machine learning to really reach its full potential there's three major things that need to happen one data itself needs to have the same production practices as code right if you think about the last 10 15 years in software development code has done a tremendous amount of change with how we deploy it how we use it not just the languages and frameworks that were created but how we move from one version of that code to the other right and I'm more specifically talking about git the way we have ability to version control roll things back integrate and test it on very specific versions of code make you know is made it possible what it can do today in production but data hasn't inherited a lot of those same principles we've moved a lot forward we've moved the needle forward there but dienes always kind of stayed in that same static thing it's a it's something uploaded to an object store it lives in a data Lake but there is absolutely no versioning so it's this it's this highly protected resource and for good reason but we we just kind of throw it at things and then we expect an output but when we want to explain what happened it becomes very very difficult the other thing is developers they need to be empowered not not restricted right so we need to give data scientists and developers the ability to work with this data but not with this impending sense of doom of if I mess anything up it's gonna destroy everything that last known good state of data is gone and I have to email someone to roll a backup that's gonna make wake them up at 3:00 in the morning right they need the ability to be able to be empowered to you know experiment freely but also move quickly and use the right languages and frameworks that work for that project not because everyone has used that language forever and finally organization-wide confidence and this really kind of comes down to I think this is personally my most important thing when I think about applied machine learning today in the enterprise especially if we start to think about responsible machine learning and AI and data science in general when we have data and we use algorithms to gain new powerful insights in that data we need to be able to be able to explain those outcomes at any given point and be able to explain them clearly and quickly and if we think critically about that and more specific use case if someone is applying for a bank loan and they're going through the website to do that and they put in all their information but behind the scenes the bank has invested a lot of money into training what is a good candidate for a loan and what isn't if we don't take a look at how we train that model we create bias right and I'm pretty sure you guys are all well aware of AI and m/l bias but the thing is is if someone says especially with things like gdpr if we if someone says hey wait a minute why was I rejected I want specific details as to why my loan was rejected if the company is using a heavy machine learning model to do that explaining that outcome becomes very very difficult if not sometimes impossible and so a pachyderm well you know where I'm kind of work to help communities of developers out there leverage the platform these are kind of the big things that we want to solve and I'm gonna show you guys how we solve that and kind of show you guys our platform a little bit but more just kind of give concept around best practice and how an end-to-end machine learning pipeline and all the characteristics it really should have to be successful all right so what are some of the biggest obstacles when we think about end to end machine learning or end-to-end pipelines there's a lot of obstacles and these are just a summary of some of the big ones I'm pretty sure you have your own individual major obstacles right one is data divergence really we can sum this up to you know big if you change anything you change everything so keeping track of all of those changes ISM is next to impossible without some sort of provenance or without some sort of data lineage keeping in play and if you're making decisions from this and when you think about models of being you know trained autonomously decisions and and the history of those decisions get lost and in just clouds of noise really very quickly and then of course there's to Lincoln trains I use Jupiter notebooks I use Python directly well you guys are noobs I use C++ or you know whatever there's this big argument of which language is better I was talking about it with a fella at lunch you know of whether Jupiter notebooks are worthy of production or not and we were just going back and forth and it's it's you know there wasn't a complete opposite side of the stance it was just we had different approaches to things and if we want to do if we want to have teams of data scientists working together on a to solve a singular solution we really need to make sure that they're empowered to use the tools that they know best or at least the tools that's right for the project just because the rest of the applications written in Java doesn't mean we need to really introduce Java into this right it just needs to be the language or framework that works for the problem at hand and then of course most importantly again coming back to that provenance idea everything needs to be reproducible if we think about it we're data scientist we are scientist most machine learning and AI stuff that's happening today is still missing that reproducibility one of my favorite github repos that I follow it is a machine learning paper with code and it's just a list of all the machine learning papers that also add code simply because there's this big idea of we're publishing all this great material but nobody's showing us how to reproduce it and how we can recreate these results and it's pretty important and pachyderm is you know it's one of its main things as making results immediately reproducible I'll kind of show you guys what that means here in a second so to kind of sum up that last side if slide if mmm if data means the same production practices code well that's version control for data essentially get for data if we need to be have if we need developers or scientists to be empowered well that means containers and containers data pipelines and that becomes really powerful here in a second which I'll also show you but then it's also being able to instantly reconstruct any past output or decision that's imperative that's boilerplate that's that's basic science right if we're thinking about using math to prove something we need to be able to show our work right and so that's data lineage and that's these are all things that pachyderm really focuses to provide from an open source standpoint I'll kind of skip this slide because that's more businessing we want to get down to the details so data versioning right what does that mean right so we have the ability to identify and revert any bad data changes if you think about a data set we actually just released a version 1.8 and one of those functionalities is being able to take structured data and then automatically split it up so it can be distributed across a cluster but that sounds pretty awesome in a sense but what it really does - in the sense of Providence and versioning is what was a single database dump or a single file that was structured data when it gets committed and split in a pachyderm repo each row can be split up into its own individual file and then inherence data or aleca versioning or tracking system so everything gets a commit ID so what was previously one big commit or one big file is cannot be split up into chunks that where every single chunk has a commit ID and can be tracked back completely so if we think about oh I want to run an ETL pipeline and I want to do all these things to this data to shape it to prepare it for said outcome you can immediately recall and go back to what was it that I inserted if you look at the graphic on the right one of these things you know is not like the other right one all works on the road or driving ones a boat in water and we can see how data gets injected sometimes and it's not like the others and so we need to remove it and so we need to remove it quickly and then not affect the stability of things right so that becomes really powerful and so that's one of the things that we produce and I'll show you guys an example of that the containerized pipeline I can summarize this to really quickly write docker containers more specifically docker containers running on kubernetes this is how we accomplish both the ability to give you know developers with the freedom of the tools that they need to use this is great for ops if you have ops that constantly gives you push back of oh no I'm not going to install that because that means a lot of work for me but if you ask them if they have docker they support docker most of the time who they say yes or they're evaluating so pachyderm leverages containers and more importantly well equally importantly I should say kubernetes and so that gives us the ability to scale write and leverage things like GPUs and run pretty much anywhere right on Prem off Prem in the cloud hybrid cloud and so but more importantly I think to you guys what containers in these containerized pipelines really allow us to do is to take our whole pipeline like I showed you in the earlier all those steps and instead of thinking them or thinking of them as as things that we stack on one or another more of like as as a you know kind of string together thing to where everything is independent another and then you can follow the individual output of everything so one example that we had is I was working with a customer who does genomic data science and they had all these samples coming in from the field and different areas of the world and they would have to do for each individual bench scientist they had to do their own kind of ETL or their own refactoring of this sample data so they could get it to that bench scientist and if you imagine for the computational team that was running all this that was all ad hoc everything was independent there wasn't any really reproducible steps and so that create a lot of challenges and so this company was hiring these really brilliant individuals to basically you know manage scripts all day and so that was really frustrating and but when they use pachyderm they're able to kind of declare things independent of one another and then tie them together so where these where these behaviors are similar but not quite the same they can define them at different pipelines and then chain these pipelines together it also you know with you know another big thing and I put in this puzzle piece over here is the ability now that code inherits a lot of those production practices or data we can start doing things like CI CD in a more real way in a more tangible way because we're testing data just like we do code so one example of a why should say pseudo and and machine learning pipeline is a pull request I actually have opening cube flow right now but it's you know really what I can what it shows you is how you can take you know data you can pre-process it you can version it throw it into your training you can then you know leverage cube flow and Seldon to not only run that on actual hardware or distribute those workloads but also you celled in to like serve different parts or model so that kind of builds on the previous slide of where every aspect of your pipeline is defined by you and you can use the tools and services that make sense so if Selden is something that you guys used to serve Mongols out that's as it's super simple to implement it's just adding another pipeline you know pipeline spec into your whole pipeline and it becomes very simple to just kind of chain these things together and maintain provenance and lineage throughout the whole thing to Providence which I kind of talked about I've talked about several times it's tracking every version of that data in code so if we put input something and then we output something or we change it in between everything is tracked and everything gets a commit ID so we can trace it all the way back and say where did this things start and how did it get end up this way and I don't know about you guys but that has helped me a lot with some of my models simply because you know it data gets restructured or changed at one point moved from one data frame to another data frame and things kind of merged around and then I'm sitting there for four hours going why am why is this thing totally wrong was because of this at one point I did this structure or this change in my data frame and it messed everything up and I wasn't able to really track that without some sort of provenance there's just why pachyderm has made it so much easier for me to do those things quick stack diagram you know basically like I said we run on top of kubernetes we have this pachyderm engine that is you know that lives kind of on top we have this pachyderm file system similar to not the Hadoop file system but some of the same concepts are there but basically distributed file system across you know a large cluster and then that's all our open source stuff we have some enterprise stuff you know that's that's kind of I think not so interesting to this audience but you know it's like dashboards statistics and access controls more things that I think ops would be more important are interested in but realistically everything that I've been showing you today so far you can completely use in our open source engine and it's really powerful and really interesting so let's go through this in action and then I'm gonna go to a little bit of a demo where we can dive into actual code and examples which is what I think you guys want to go to but I think it's important for us to think critically about what we mean when we say data provenance and data lineage and the impact that that has as we as we think about or hopefully in the back of your mind you're thinking about how am i tackling that and my workflows right irregardless of pachyderm but if we think about gdpr i think we all have at least some familiarity with this in some regard you heard about it or you know those you know that acronym but basically gives you know European citizens a bunch of rights own to how their data is used in how its protected one of those is being able to just the right to an explanation right so if some if you get an outcome you have the right to get a full human interpreted you know basic cause of what happened and why it happened that way but if you haven't you know if these if these companies you know if companies are you're leveraging machine learning there's still a lot of that black box and so if we think about it if we have a bunch of users in a database and we train that you know trained a machine learning model to say yes or no whether someone gets a bank loan or if I want to make a not throwing shade in Amazon but they they wrote a big article about how their job AI system they shut it down because it was by source women for this kind of similar reason they threw a bunch of data at it but they didn't really understand what data was being used to train that model and once they did a really deep analysis they understood that it had a lot of masculine language in it and so candidates that had more feminine language in their resumes were getting bumped to the bottom and they didn't find that out until way late and millions of dollars were invested but you know so and the concept of GDP are it's this kind of the same right so we have this database of users or database of content or data we're training it to do a decision we're deploying that and hoping that everything goes our fingers crossed right nobody throws a fit but with GDP our that makes it you know a lot you know users have a backed by law way to get you know an explanation of that outcome and so if you don't have some sort of provenance or lineage or both in terms of technology as well as business process getting to that output really tracing back to why that decision was made is a very arduous process and sometimes impossible which can really create some problems so but with pachyderm it kind of starts to take a little bit of a different shape right so as I said everything that you put into a repo and pachyderm gets a community right think exactly the same way as you do you check in code with git when you when you check in new code it gets a commit ID it's in a repo it's you know then you know you can deploy your containers to do your hair all your models and then you can serve that model out whether Seldon or whatever you want to use and we get to that same point to where Jared goes wait why was you know or I don't want my mom you know I don't why was I not approve or in this specific case you know I don't want my information to be used to create automated systems which is another gdpr right you can say that I don't want my data to be used in any any kind of automated training way so what you can do though if that's a really simple problem to solve is Jared raises hand says I don't want my data to be used for that thing you just can delete that file the rest or the with the with the pack control you can delete it and then the rest of the data set and all the model gets retrained on that exact same data set we don't have to call someone to open up a database and find that user record delete it then read dump it and then try to get it all the way through the pipeline again that's that's all that all becomes automated so it becomes pretty powerful when you start thinking about not just deletes edits or ads or you know what have you so yeah so this is pachyderm in 60-seconds because i really want to get to the good stuff but I have to go through these slides so that we can come go some concepts right so pachyderm is an enterprise-grade data science platform that allows you to build these multistage pipelines that are language agnostic and main complete reproducibility of code throughout or and data throughout the entire thing right so it's version control for data its containerized data pipelines and then you know concept around datum provenance on top is everyone okay you can guys give me a thumbs up if there any questions to hear I'm gonna go to the demo just a couple thumbs up I'm seeing a couple so I'm gonna move on all right this I can't do with holding a mic so I'm gonna drop this here and try to stay still all right so let's open this up real quick enhance and hence enhance okay and then let me go to our dashboard real quick cuz that's an easy place to start so this is my dag right I have some sorry I have some different things to find these blue hexagons define repose think of them similarly or exactly the same as you would a git repo and then these green icons are like my pipeline specs right these are the instruction set that I'm giving to pachyderm to do things and we'll take a look at each one of these but from the dag you can see oh here in my images repo I have a picture of an airplane and it's just an airplane right nothing special there and you can see I have one file in there you can see it was it was done in one commit and we can even take a look you can see it's on the master branch right has some really interesting information here and so what's happens is I also have this model repo that has basically the pre train model from tensorflow right I didn't I didn't want to go through training a whole model here otherwise we'd be here until Thanksgiving and we'd miss out on delicious Turkey but so I have this pre train model and you can take a look at this pipeline spec which i think is just easier to open up in the command line so if I do them detect or actually was do model so you can see this is my pachyderm pipeline spec it's incredibly simple it's just a JSON file I give it a name I pointed a repo and then I you know pointed this frozen in inference graph but you guys understand you know I think you guys all generally understand that I'm just telling you no pachyderm to say you know take a look at this training repo pull images and come you know and then leverage this frozen inference graph in fact I think if I and then you know output something so that's kind of this detect phase 2 so that detect model is going to take point in that as well so if I do detect this is a little bit more of a detailed pipeline but seems pretty simple I'm doing is I'm giving it an input and impact me we can give it multiple input sources in different ways and like joins or crosses I'll save that for another time but basically I'm just saying take a look at the images repo you know and then take a look at that model repo and then do the object detection which is that Python inference model and then there we go right so that's the basic outline of these shapes right training is more symbolic than actual reality because again we're using that frozen inference graph but you can see how all these steps are kind of then tagged so let's take a look and actually put a file in there so if I go into image and I do PAC control put file but not airplane we have dogs because he doesn't love dogs jpg boom and then try to see if I can catch it it goes pretty fast yeah okay so that I think is already gone so we can see in the detect phase dogs dogs is already shown up and oh I should have shown that before right that airplane picture we input it then it went and did the inference growl or did the object detection through tensorflow and then it outputs the you know object detection so airplane was in there to start but I just put in dog as you see and then you can see I'll look at that some adorable beagles and we can do this all day long right and so if we do real quick if I just do packet roll you can see these are the different repos I have if I do PAC control list file images master you can see I have my different files in there and then I can do PAC control inspect all I can show you we have a lot of inspect but if we do to inspect file we new images or I'm sorry detect master dogs jpg you can see yeah I can pull out details here I'll try to see if I can show you something else that's pretty cool let's to see if I can catch one real quick let me try to put kites in there a tape quicker Nick type quicker I didn't type quick enough oh well we'll see the history of it you can't is goes the old saying you can't type when someone's looking at you I have like 50 plus faces looking at so you can see so every job that I put in there so every time I put an image in that input repo or that Images repo pachyderm notice to change then created a job that was defined by my pipeline spec so it says you know take images from here you know use this inference graph to do object detection and then put them in this output repo and you can see every state is tracked and then I can see that everything has a commit ID so every time that model or every time something was done through the image detection it created you know a community so then I can see that if I do I'm just gonna copy this somebody up a control and inspect you can see that I have a commit ID here and then I also have a parent ID and if I go back and trace that you're gonna see how it went through those different things it had a new child and parent assigned to it at every stage and change so if I wanted to figure out how this dog or what model was used to detect this dog this image of a dog I can immediately follow that all the way back and you can see the provenance is tracked here as well so we can see that look it came from images and then you know its output you know parent are all located here so you can see that Providence is never lost as even if it's going from different repos and going through different changes and so I guess with that I spoke very fastly to the slides and I kind of you know rush to the demo just because of time but I want to open it up for questions let's throw stuff at me let's explore things here if we can that's the best way to do it yes I saw you go up first to go for it [Music] yes so you can use and I'm just gonna go right here so his question was in case you couldn't hear the mic is he was wondering - no I did a put file to insert that file into a repository to kick off all the pipeline stuff but he wants to know can it just be like automated we feel like a user request or some sort of hook or cron or something like that and so yes so if we go to our I'm just gonna go to our developer documentation and I'm gonna blow this up as well and I go to my pipeline spec right here we have a lot of different ways that you can do that so you can either initiate it in a cron job or you can use git you can use some webhook stuff as well but a lot of what people do is so they they time it either via cron to run like every second on something or saw a repo and then as it changes every time it changes it does it or you could you work at the opposite way to where your user requests you know part of that JSON or whatever is telling it to update a pike or PAC control repo with that and then once pachyderm recognizes that oh this data set or this repo has had a new commit it kicks the job off again right so yes any other questions this guy up here in the front I think is the second hand I saw if you're in the back I'm sorry shout I guess hey so this may be my misunderstanding of gdpr but I noticed when you were going through the process flow of removing a user's data in my head I was curious if you were going to rebase or delete the history of that and so to me if you rebase you still have the users data in that repo whereas if you destroy the history then that could mess recompete all the hashes and potentially mess things up you actually thank you you actually kind of caught me on that so wasn't anything past but yes he's absolutely correct if legally in gdpr when someone requests to be deleted as I didn't bring that one up specifically because someone else mentioned I just got to change my slides would you be are someone says I want to be deleted that means everything about their data has to be deleted there is no history of this person they are then a ghost right so in pachyderm we can remove that person from that repo and they won't be used to be trained in any future model to delete that data they'd have to go to the database of wherever that user lives and then delete it and then if they you know to really go through it you'd have to go through specific figure out every commit impacted room where that user was used which isn't necessarily difficult it would just be an extra step and then kind of removing that from the commit history and then it would like rebase but what I'm trying what I was trying to explain is like so if I wanted to if I do PAC control delete file images master dogs dogs or dogs I can delete that file and if I do a PAC control this images excuse me images master it's still there what did I do yep there you go see look again you can't type when someone's looking over your shoulder I challenge everyone yeah there we go thank you for catching that good sir so you can see that file is gone but realistically in the commit history and all that other stuff that would be there but you would just go through the database of wherever this user information lives delete it and then even then you would could just kick off your pipeline after removing that file and the pipeline would rebuild and get you know a future state to where or a new model that's trained without that user data however so I've had this conversation with some of our co-founders in depth and you know gdpr is a good example and we are working towards like an actual delete so when we do a delete how do we go back and make sure that all the hash trees and everything or like that data isn't there so that's something we're absolutely working towards but like as a simpler example you can remove files from pipelines which you know is also part of the GDP ah right you can't use that date users data to train all mated systems so hopefully that answers your question but good catch good catch but alright is there any other questions I hear Oh got a question here I think again if you're in the back you're just gonna have to holler the lights are pretty bright yeah there you go like the way it's like 2:00 in the morning or yeah so we can leverage cron to do that so like say we want this job to run nightly at 2:00 in the morning you can just part of your pipeline spec could be a cron job that says you know go and fetch the latest set of data and then it would kick off the job and do it then and you can set that cron job to be however often you want and what's cool is because pachyderm is everything is you know again a docker image that's then connected via a pipeline spec and then obviously a repo of data that has all this commit history you know piece again piecing things together and then being able to be very declarative on when things run how things run what output comes from what is kind of the whole purpose of it right and then again of course doing the scale so we and our recent 1.8 announcement we'd like increase it which was done let's see what's today Sunday we did that Thursday Friday we did like we upped our scale to be at the petabyte level so you can scale to like hundreds of you know hundreds of millions of files that you know the thousands of nodes in the cluster and so like the biggest of enterprises loads can happen in this completely and pack it and be able to orchestrate all that stuff being well pachyderm is initiating it really it's kubernetes that's kind of taking that load and separating across the cluster and you can leverage GPUs in there as well so sorry I went a little bit past your question but yes you can be very declarative all the way down from when your job runs to like what resources it uses so hopefully that answers your question ok are you compressing data at all no ok so that's just massive amount of raw data it's up to the user to compress it yeah I think like it could be part of your pipeline to compressed data if you wanted it's all about you again it's all about how you want to declare things to be if you have a step in your pipeline that does compress data you could do that but by default to my knowledge and I could be wrong I'm on record here I think I've the other recording button someone may go oh no absolutely you can so you know as far as I know where you don't compress any data out out right like it's not a default behavior but I'm sure you know if that was part of your pipeline you can do that any others and try to keep an eye out in the back all right well what I'll do is because usually when I go to talks like these you know people leave it at a demo but they don't really give you a concept of like where to go afterwards if you want to play with it at home so really like I said we're we're a bunch of open-source nutcases at our company and so we pretty much do everything on github and we have a slack channel that's also available open for users so you can join I'll just pull up our website real quick not to promote our product but really so that you can just see if you want to get involved ask questions without being you know like super pressed for you know you know everything we have a slack channel we can join in you can ask any question you want doesn't matter how simplistic or difficult we all are very focused on that our github is another great place to go to find information see what we do more discussion we are also starting to with every release really starting to put out more blogs about kind of giving more insight into the backend of why we change things the way that we did and more in an engineering level so if you want to see like how we change to go from like the scale we were at previously the scale we are now we have a great blog post in there from one of our the engineer who did all that who explains like why we change and how we changed things on an architectural level I'm not that smart so I'm not even gonna try to revert it because it'll sound all kinds of broken coming out for me but it's very cool post from him to take time because we're just all about being transparent and sharing information but we have a lot of really good documentation and examples to cover just about a lot of so many different use cases we go you know we can do database stuff we can do genomics pipelines Jupiter notebooks we have some like ml stuff the one that you saw today that I was doing is object detection so you can follow this example out right just lots of really good ways for you to put your hands on on it and see how you can take even you know any kind of data that you have and then immediately within just a few seconds it takes two minutes to deploy pachyderm all of a sudden your data just starts inheriting this lineage this provenance that you can use and gives your models a lot more of a scientific approach which is incredibly you know valuable and multiple fronts and as far as I know I think we're the only company that provides this information or does that data provenance at you know at this kind of scale but yeah so check out our github page jump on slack if you want to ask questions and that's that's really it unless you guys have any other questions and I highly encourage in there are no dumb questions except for the ones that don't get asked oh there's another one yes have you seen I guess like so when this pipeline is in production or used on some tea with some team and they have like tons of images right like tons of data that when you try and pass it into a different stage into prediction at large batches like like is that already just easily taken care of just you know just like from like the just from the pipeline itself or go ahead yeah so if you're asking about yeah so you have these so say you have these different links I think you have like two different David lakes is that you're asking or you want to move data from one side of the or I mean I guess it's just like how well have you seen this perform when you're when like your database or like the data that's coming in is like really large like off the scales really huge yeah so what's again this is pachyderm but more so I think the beauty of kubernetes like when we can take what we can do it so I was really excited about this split functionality we did with our 1/8 you can take this really large data set and I have a I have a blog out there but and I can I mean I could show it to you really quickly but what we can do is we can take data in a single structured file split it up into however many files and we can specify how many file we want to split it up into so by default pachyderm will take a table of rows and make each row its own independent file and store the header information with that file and so it'll create you know X amount of files however big your row is but you can also tell pachyderm to say okay instead of an individual file for each row I can say I want every ten rows to be an individual file in pachyderm so how that applies to large data is if you whether it's structured or unstructured or what-have-you it's all you know kind of object store to begin with so it's how you know how you move things from one object sort of the others gonna really depend on speed but as far as pachyderm ingesting large amounts of data it's as simple as a put command and it just kind of goes and either breaks it out or does it but that job that that independent thing is spun up as a docker container to do all of those things and then they get spun down so it's it's really like kubernetes that shines in that regard but yeah that scale is there alright I think that's it I'm seeing any hands all right going once going twice I'm gonna be sticking around to my flight doesn't leave until later so if you guys have any questions that you just want to be on the spotlight for feel free but thank you for your time and thank you for you know everything [Applause]
Info
Channel: PyData
Views: 12,211
Rating: 4.8400002 out of 5
Keywords:
Id: DGeVRD63xZw
Channel Id: undefined
Length: 38min 43sec (2323 seconds)
Published: Thu Jan 03 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.