DC_THURS on Dask and Coiled w/ Matt Rocklin

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome back to dc thursdays it's another thursday and we have another awesome guest on the show which i'll introduce in just a minute uh i'm pete soterling i'm the founder of data council and data community fund as many of you know and i'm here to guide you through the wild west of the data ecosystem we've been fortunate to have on lots of awesome founders of lots of awesome open source project leaders and authors today we have someone who's actually floating across both categories which will be uh turned into a great conversation so before we get started and introduce our guest i wanted to thank our sponsor ibm who's been supporting data council and the dc thursday show we're super excited and very grateful for them and their support they have an online community for data scientists that you should check out we'll put the link in the chat to the right of the video um so make sure you you check out what ibm is doing for the community they're believers in open source um they have nlp projects they have ethical ai projects they contribute to the python community and so we hope that you'll check them out and tell them tell them that you are grateful for their support as well as we do every week we are going to open the the chat to questions so feel free to jump in at any time if you have questions for our guest we want to make this sort of a cross between a podcast and a call-in radio show so the best of all generations are right here at your fingertips so feel free to sound off in the chat and we'll do our best to get to your questions so um without further ado i want to introduce matt rocklin who's here matt is the co-founder and ceo of coiled computing which is the company behind desk and prior to that uh matt worked at nvidia he worked at continuum analytics he's obviously the author of dasc which we'll get into which is fascinating story um there's lots of tasks there's a large community if i don't say that everyone else will jump on me sharing the credit from the start which is which is great so thanks for for jumping in that's well said um and uh before kicking off or or having something to do with kicking off the desk movement matt was a researcher at sandia national laboratories he has a phd in computer science from the university of chicago and he also has a degree from uc berkeley so matt those are quite the credentials um thanks for making time to chat with us oh that's great i love i love doing these things it's a fun the whole like my story is like a small version of what generally speaking python has been through over the last couple of decades and it's a it's a wild and unexpected run personally and from the community perspective and we were chatting about that a little bit um before we came online um but we'll we'll get into some ecosystem discussions about how python has tracked over time before we do that i just want to sort of understand um the earliest days of your career sort of leading up to your work on desk what what were those days like you know you studied math um you have a cs degree um like how did you sort of merge those things into this this interest in data like when when did data start to become a thing for you did it become a thing for you or are you just a distributed systems guy who happens to be sort of playing in the python world yeah and actually quite the opposite so i actually started in undergrad doing you know physics and math and astronomy so very much came from more of a scientific practitioner's background but i think as a kid i had a little ti calculator that i would play with a lot and probably and so i knew how to program and that it was odd how much of an advantage that gave me back in you know 2000 when working on on technology problems or working on science problems and so i was never a very good scientist or very good researcher but i was i was in i was an order of magnitude more effective at enabling my peers to do their research by helping them work through computational problems and so that very quickly became a strong inducement to actually not do that science stuff but kind of support scientists with computation um i ended up sort of falling into sort of the the u.s department of energy national labs the career because that's maybe like a good mix of sort of practical problems and technical problems um but in doing so i was also working on the side in open source uh and in python and that just had like an order of magnitude more feedback and if i was having a lot of bigger impact and so pretty quickly after being at sandia i decided to jump ship and i joined a continuum at the time which then became anaconda it's where pretty much all of the sort of open source for-profit people went at that time and it was a really exciting time for anybody who wanted to have a big impact in open source python and doing science and data science that was where you went so that was maybe my my transformation first science to computing and then from maybe like academic ish to open source commercial open source companies which were very new at the time it was still quite uh quite strange great and so it was around 2 2015 that you dropped the the paper on desk if i'm not mistaken is that right yeah a lot of things happened around that year yeah so i was brought in at anaconda to think about making python faster i had done a lot of open source work and they wanted to attract people who could make python more attractive generally maybe being a so goes pythons or as python goes so goes anaconda and anaconda was concerned at the time about spark which was gaining a lot of traction at the time arc was this kind of like other data process ecosystem apart from python and there was some existential concern and so i was thinking a lot about how to take the existing python ecosystem and paralyze it how do we take you know numpy and pandas and psychic learn and what eventually became pi torch and other tools how do we scale those out to many machines that became lots of projects at first but eventually desk and so there's this notion of blocked algorithms i don't want to get too too much into the the implementation details but um are blocked algorithms used both in desk and in spark is this the general category of processing that we're that we're considering um yeah let me think about that for a second uh so das does many things and so people sometimes ask like what does that do and it does lots of stuff so i like i want to avoid putting it into a box but let me maybe uh answer a slightly different question which is that so with with task like we didn't want to do what spark did spark recreated an entirely new data framework to solve data processing problems with das we had the python frameworks already around we had numpy we had pandas and so the goal was well you know if i have if i have a really big data set can i just have you know 50 machines each of which are using just pandas and reuse a lot of that same code and so a lot of das algorithms early on were around uh pretending you had one big pandas data frame when you actually have 50 pandas data frames and then you're building parallel algorithms around them that reuse the existing uh ecosystem the existing code we already had from you know decades of development so that's maybe a case where das and spark differ a bit and that brings up this idea of blocked how do we reuse what we have to gain something new to up level our existing workflows i love the framing of how you walk through that in the uh intro to the paper because um you explicitly define that um why would we throw away all this work and all these thousands of man hours of engineering time and research in the python ecosystem that have been battle tested and tried and true just because we can't run these things on multi-core machines and so the ask was positioned literally as the infrastructure to bridge that gap and i think it was really well said in the intro to the paper yeah it's a super interesting super valuable but also really hard problem like it's not just a technology problem you've got to get into a bunch of different code bases you need to get a bunch of different communities it was like a social engineering cultural problem to address here at the same time you're building technology had to be really unopinionated and very receptive in order to have the kind of impact it's had and not just with pandas pandas is you know maybe twenty percent of das usage there's a ton of other libraries that leverage das today that rodask was put in after the fact it's like um it's like a turbocharger to your car something you add in well then so so before we get into more specifics on what desk is what das does those kinds of features um since since you've sort of placed ask against this backdrop of the python ecosystem at large explained to us what was going on in the python ecosystem at that time it was still mostly used for academic scientific research purposes no or are we like what was it i guess google obviously has been sort of a big python production user since um the early to mid 2000s but what else what like is there is how do you sort of frame the industry versus research uh sort of curve that of python adoption yeah no there's a like we could write a book about that topic i'll try to constrain my answers but um yeah so python started off as an academic educational language actually it was designed to be easy to learn it wasn't designed for data science it wasn't designed for web programming it was designed to be easy to learn in the classroom that ended up actually being really really useful for growth right education tends to be a great way to get high growth if you're willing to wait 10 20 years uh python then went into sort of web development and syslops so that's where google was using it a lot of web developers used python still not in data science then i think a lot of the matlab community shifted over and so python had really really good connections to c into fortran so it was very easy to take some of these existing scientific libraries that were amazingly fast battle-hardened code it's been around since the 70s and just like you know runs your local nuclear power plant um but make it really accessible and so that combination of speed and accessibility ended up speaking really really well to scientists with a time where you know semi-technical but not computer wanted speed and i think over time it ended up also being a really good proxy for data scientists so i think python really built out its infrastructure serving the sort of scientific need and that gave it sort of this like this head start on the same kinds of computational accessibility needs that data scientists and data engineers needed as well so that's where i think we've seen python to sort of like suddenly leap frog into business because it has all of this very interesting history with with science um as an interesting artifact a lot of people in charge all the scientific life of all the python libraries we all have a science background and so if you look at you know jupiter was started by neuroscientists and quantum physicists um you know psychic learn also by by neuroscientists but there's this like weird uh history everyone who's my age in the ecosystem and so from a traditional conversely from a traditional cs perspective um you know python programmers were always kind of considered to be hacks right it was uh yeah yeah yeah it was it did didn't come from from the the hardcore computer science end of the spectrum yeah i'm actually very atypical that i do have a computer science degree i'm like the weird guy who actually thinks about distributed systems but i also can kind of communicate with everyone else in the community um yeah i think python has always been a very pragmatic language distinguishes it for maybe like scala which was like very theoretically pure i think again that pragmatism really um spoke to the eventual need in business right so if you look at history there was maybe like numpy and scipy and second learn and then pandas shows up that brings in finance psychology machine learning uh you know the deep learning frameworks came in a few years ago and that brings a lot of like heavy heavier groups there's been this transition away from science into industry over the last you know five to ten years based on how you count it and then as you mentioned even underneath that there's this big underpinning of just mind share around python because so many people had bumped into it or brushed brushed had a brush with it or learned it in school or so there's this kind of quiet groundswell of all these folks who had experienced python in some educational computer class or whatever that was probably also meaningful in just a very broad sense yeah it's also i mean the fact that there is a web stack attached to python is amazing right i mean you can build your machine learning model and host it in the same language right and it's actually quite hard to do that in most other languages there's no there's there probably is a flask for scala but like it's not something that i'm aware of right it's the fact that both of those are around are actually quite quite convenient in more of a production setting so so your vision was to embrace the reality of increasing adoption of python especially in the scientific computing stack and and produce more advanced a more advanced system that could parallelize workloads and do it in the in similar ecosystem that data scientists um using similar libraries that they were already using in uh in python yeah what's my vision but that's our vision generally so what happened yeah go ahead people love using python on a single machine they run into a big data set and there's this big well what do i do now question and ask more or less lets them continue doing what they're currently doing with sort of minimal pain to scale out rather than having to rewrite in some other framework mpi spark custom cues etc and what happened i'm curious when the paper was released was it obvious that that you're to something or um like what was going on in your in your mind as the community started to respond to this um this new info and this this proposal yeah it's only it's funny funny you mentioned the paperwork i think and no one reads that paper it was written as part of a conference the scipy conference which all the the python maintainers go to every year um most honestly the the biggest thing that sort of advocated for for desk was my personal blog so i actually i blogged i live blog the development every few days during development i would talk about what i was working on and then actually that transparency i think actually brought a lot of other library authors into the project really early we had groups like psychic image or x-ray or pandas those maintainers started getting involved really early because they saw what we were doing we're also very able to shift development a lot the task was very nimble we pivoted quite a bit in the first few months as we were getting used to work working with these other library developers i would say within a small number of months two or three we had excellent traction among other core authors and then you know a few months after that we had good traction with their users and that really caused our um our growth to increase so task ends up being sort of more of an infrastructural player that ends up being sort of like an end-user tool or user points so so just so i understand so is this to say that you were not just building an open source project and sort of um you know soliciting community contributions et cetera but you are also blogging in english about design decisions and architectural um choices and sort of you're you were showing your work showing your mental work if you will um sort of via blog posts as you are also contributing code and and pushing the project forward from uh from a bytes perspective yeah absolutely so i mean i come and i come from like an academic perspective and i i think a lot better if i'm writing something what i what i do and so yeah das started very much in like a bottoms-up way we like made a little task scheduler we then tried to numpy around that we then tried to do pandas around that we then tried to do lists around that and all of those were little blog posts we actually pulled all those things out of my blog they all now live at blog.dask.org and so if you go to that webpage you can actually see down to the very you know the first week of development it was i think christmas week i was off on vacation and you can see the very first thoughts that were going through our heads as we were working on it that's great um you know that's a really amazing story of um the power of building something in the open and and and and showing your work with the community and again i think that comes from a lot of python's principles of being more academic a little more honest a little more transparent community-minded that's had great uh that i think that has caused a lot of tasks great success is that that strong transparency and community openness awesome so um so let's get into the ask a little bit more deeply um favorite topic yeah and now now we can talk for hours so so um what are the main uses of desk obviously to ask is parallel computing for the python ecosystem maybe you describe it differently i would love to hear how you sort of in your own words um what it is and what it's used for yeah i think that's that's the correct thing so people ask where's where is desk used my response is where's python used and it's more or less the same set of situations um maybe i'll answer that with an anecdote uh please so we the original goal of das was to paralyze numpy numpy being this sort of multi-dimensional gridded data structure that's really really common it backs more or less all of the python ecosystem today for computation the idea was like right we're going to accelerate that and then everything else will follow that end up not being the case what ended up happening is we we built that we then built pandas uh on desk and then we showed that to users and they said about half of them said great this is what i want i just wanted big pandas i just want a big numpy this is perfect then about half of them said like actually like i don't really want big pandas i'm doing something way more complex than that i'm building out my own custom thing but the engine that you had to build in order to paralyze numpy and pandas that engine is actually really really useful to me um it is as though we had built a nice car people said great i'll take the engine please so i'm building a rocket ship or i'm building a submarine and this speaks a little bit to i think how python is used a little bit differently from how like most of the other data infrastructure tooling is used i think a lot of us come from more of like a business intelligence standpoint we think you got a table of customers you then want to run some sql queries on it do some lightweight machine learning the python users and organizations are often far more creative and they're they're doing weird experiments they're trying to bring in new different kinds of data sources they're bringing some audio files they're doing some other weird stuff that really breaks the mold and so about half of das usage was was enabling those other experiments a good example of this is like prefect is a company workflow management completely different from big pandas and they're built on top of desk or they can use das under the hood to scale out right and the fact that das like the prefect authors sawdust said oh great that's everything i need to sprinkle parallelism into my code base this very different kind of code base great and so really das is kind of this you know magic parallelism dust that other library authors put into their code but common use cases i would say it's about a third big pandas about a third big numpy which is you know imaging mri climate science big industry and about a third wacky stuff which is you know common in you know in hedge funds they've got really advanced trading strategies that are being run on desk because they're just it's just changing really quickly and they need a lot of sort of very custom stuff you know credit risk models inside of banks is really common uh xg boost uh deploys on desk really comfortably that's a very common uh combination point but yeah a bunch of stuff i'm happy to get into particular workloads but the the general answer is das is a very general purpose parallel computing platform which has been so easy to integrate then now there's a lot of a lot of common use cases built around it got it and so what's the typical um entry point into das um for a sort of production workload or a team like when do you see somebody is is it because they're they're pandas data frames processing um just isn't scaling the way they want it to is that i assume that's a sort of common on-ramp onto desk are there others yeah so you maybe you're on your laptop and you you know i've been working with csv files that are hundreds of megabytes large you'll be given a 20 gigabyte csv file and your laptop doesn't have much ram instead of importing pandas you import dask and you just use das on your laptop to process that much larger file that's probably the most common individual use case so task is you know it comes installed by default with anaconda most people have it on the laptop today if they have sort of a data science stack and it's just the easiest thing to get started with and they maybe will stop there right you can actually do quite a bit of computation on a macbook pro right so if you sort of go to the 20 50 100 gigabyte level like you're probably comfortable just on your laptop with desk now you then could say well hey actually there's this five terabyte data set sitting on my company's cloud how do i get das deployed on my company's infrastructure to use it at scale and that's like a whole different paradigm shift and people love it when that happens it gives you all this power that you're already used to sort of the interface it's it's very fun seeing people's eyes light up when suddenly they've got you know not four little cores working on a laptop but 400 working on the cluster and it all feels the same it really really connects the remote infrastructure makes it feel very close at hand it's a really magical experience and it's the same it's just so just to be clear it's the identical code base that scales to support all of these additional cores uh more or less they're like slight differences my favorite example here is you know if you want to compute the the median of a data set it's very hard to do that in parallel um but you know das will provide alternatives like approximate quantile and so there's there's some things that are different um yeah i mean the apis like you know the the pandas maintainers also maintain task right when pandas makes a release last release in lot releases learn same with jupiter like we've got the core maintainers of das are just the pi data core maintainers there's a lot of effort to make that feel very very smooth great so then let's talk a layer lower so then what what is infrastructure deployment of diasporic what does that look like if i am a data scientist and all of a sudden i'm hitting the limits of my macbook pro and i want to deploy to a cluster did you use the term cluster or yeah um so so so what has to happen at the infrastructure level in order for that to successfully happen is that something i can do as a data scientist or do i need my it team to kind of get involved and and install things and like what is that and that's honestly what i mean i think you're like you're sort of a softball for me well yeah so uh das the open source project deploys on any major resource manager uh that's available today so you have a bunch of physical hardware somewhere on the cloud in an on-prem uh rack somewhere and that physical hardware probably runs one of about five or six different resource managers kubernetes is an example of this uh it's quite common today yarn is another one that's called a sort of older cloudera hadoop spark system if you want to sort of like hpc systems you get names like sge slurm condor lsf on ibm systems et cetera and dash deploys on all of those so you have access if you have access to those things and you understand enough about them you can deploy das on whatever hardware you have now what's what's common is that your data scientist in a company that has a cloud access with kubernetes you have no idea how that works and so you i'm sure pete could easily deploy dusk but uh that and that's what we often see we see early adopters in companies use desk and then getting sort of the rest of the company involved it's a little bit trickier and like we can go into all the challenges that people face there got it so um so the the das bits get installed on um sort of a broad set of systems and then they start to listen for workloads i mean it's that the the general idea yeah uh usually what happens is that das clusters are ephemeral and so you say hey look i want to look at this data set you ask kubernetes hey give me a dash scheduler and 50 workers radius gives those to you you know 10 seconds later you can say great i want to go do some query and then you know those workers do some work they fire off they read data they do some pandas stuff they do some pie torch things they you know then give you a license plot and you show that in your jupyter notebook or on your you know batch process or whatever it is that you do then maybe uh maybe you keep those workers around for a long time if you're gonna keep doing that work if you've got a really advanced system your system all they tell kubernetes to scale that down so you're not taking up money and we asked another question that scales back up again right um bring the question of you know they stick around for a long time are they adaptive that's a sort of an i.t yep got it so then what are some of the main challenges that companies have in in migrating to desk obviously you've seen many different types and sizes of companies and teams work to get onto this platform over the years like what are the main pitfalls gotchas caveats sort of like tips and tricks and hacks that you can mention that that are insightful and and honest yeah no um um i'm trying to like think about the dishonest fun ones would be you're you're nothing but honest so i i i don't know why i said that but just trying to spice it up a little bit yeah no i'll i'll try to try to break the mold here a little bit try to break the pattern um no there's maybe two different kinds of users we would talk to i'm now like switching a little bit into for-profit mode uh there's like the small team case so you know you pete maybe you're a data science lead at a company you you know you serve you know two to ten people in your team and you know you probably use desk and you've used it you did the whole kubernetes thing we just talked about happy with it fantastic um they're then like a bunch of extra challenges right your teammates don't know how to use the cloud they know how to like turn on the kubernetes thing on their on their cloud accounts maybe you end up sort of managing that for them you turn into about like a half to quarter fde sort of devops person challenge you probably leave the cluster on for a bit too long so like i t is kind of concerned that this this stuff is running all the time you've actually forgotten a few clusters that have been running for last few months accidentally uh you're also running python code like arbitrary python code at scale across your corporate infrastructure not any kind of security and that might create some concerns as well um you got a team to manage but i want to think about off some of your teammates are more diligent about turning things off or on so you want to make some policies and some quotas some team management and so there's like a you know there's a set of things you need to think about and it ends up taking you know an itv team six to 12 months to build some semblance of this those may be like the common gotchas uh this is where i think so i've done this i don't know how many times different companies we ended up making a separate company coiled which just has that service and it has all those things built in and it's super easy to use um and that's maybe like one set of concerns at the like i t or top down level there's a different set of concerns just things like how do i use these gpus how do i accelerate this credit risk model at scale instead of business questions how do i make sure that the ask you know doesn't break on me how do i get enterprise support so there's a kind of a different level of questions that happen at that larger scale and so you saw enough of these questions from enough users you saw the growth of the desk footprint and implementation numbers growing and um were you getting bugged by people in the open source community for for more support than you can give and hence like it's time like it's this this company like needs to happen was it that simple or is there is there something more there yeah no it's that times 10. like i honestly i should have done this a few years ago [Laughter] i mean so for reference i think like the dash usage is quite high today i think the python software foundation did a survey they surveyed all python users uh and i think when asked the question which big data framework do you use i think like 12 say they use spark and about five percent say they use desk among python users and so we're like kind of like below spark but above a tool like hive for example so turn that number around any company with around 20 python users probably has someone in the company using desk today and so the demand is is very high it was very high a couple of years ago i was actually going to make a company a couple of years ago when nvidia stepped in and asked me to instead lead a team they were sort of starting rapids at the time they wanted me to lead their sort of multi-gpu data science story so i did that for a year and then started going afterwards but um yeah i mean demand is quite high uh there's certainly a business here um i should have done this a couple of years ago well that's great i mean this is an amazing um indicator and you know proof point for other founders who might be building popular open source stuff right when you're when you're able to release open source in the wild and see significant adoption um and the questions and the support requests come back to you that's one indicator that the the time might be right to think about commercializing something around it was that difficult for you to build a commercial entity around something that you had built in the open in the wild and had donated to the community by your open source efforts like how did you how did you think about that um i mean certainly making a company for anybody is hard i mean i've never met anybody who said like oh this is easy yeah no my life is much less fun now than it was managing engineering teams um in terms of bringing in kind of a for-profit angle that hasn't been much of a problem i think i mean i've always worked at for-profit companies and working on desk right anaconda generously donated my time and video that relationship donated my time to ask is maybe distinct in the open source python space then we always have been relatively well funded by for-profit companies and yet we've also managed to maintain this sort of multi-institution community-oriented transparency that i think you need to be this kind of infrastructural project you know like i think kubernetes did this well as well right kubernetes is maybe a google project but they're really it's really managed by a bunch of different organizations tensorflow is maybe the like the counter example right tensorflow is very much dominated by one company we've we've taken steps to make sure that das is does not feel like that and so coiled feels like my next thing but not necessarily uh fully encompassing of desk itself and anaconda quansite nvidia capital one these are all companies that provide a lot of support around desk that's great i think it's it's good for us as a community to have discussions around um how the open source world and commercialization options are evolving because you know as wes mckinney has noted um you know his concern is like is is this really sustainable over time um if we don't figure out some longer tail version of how open source developers get paid because if we're just looking to the big companies um to subsidize your time to contribute to desk or other big projects like is is that the only model or is there something that's slightly more robust and community driven that we need to be thinking about and talking about as well this is obviously a bigger can of worms but you might have some quick thoughts there yeah i've got lots of not quick thoughts there too as well um yeah people share some thoughts that i think i came to while at nvidia nvidia is a weird case right because nvidia is definitely a for-profit company right there's no there's no conclusion that they're not there just to make them they sell hardware yeah right but they sell hardware it's this yeah yeah they actually don't want software lock-in they want hardware lock-in and so they're actually very much incentivized to be a good player when it comes to software kind of one layer separated i think i give a talk about this at scipy 2019 people want to look at that talk and i think that open source right now is in a good position because we have enough mindshare that we can define the rules of the game where those rules are protocols and conventions and interfaces that anybody who wants to have access to our users has to abide by so for example a lot of libraries today look a lot like numpy or they look a lot like pandas or look a lot like scikit-learn even if they're doing their own very proprietary stuff and i think that while we still have that power that control we need to really harden and make sure those interfaces and those protocols uh are solid and that that defines the game that then all the companies can come into and play on so nvidia for example we worked very hard so that numpy and a gpu equivalent to numpy could both work in a lot of different software libraries that required a lot of sort of community effort but now as a result you know if google comes in with tpus or other hardware they can play in that same level playing field and so i think that's our goal to kind of solidify those contracts the software contracts so that everyone from now on is kind of forced into that convention this may be an overly technical description but i think there's like a nice interesting point right now to control the game yeah no i love it i think i think that's a great point um so just to sort of go back to to the company um i'm wondering about your some of your thoughts on on go to market right um open source projects have a distinct advantage in some ways in getting this grassroots bottoms up community traction um but but as you know putting your your ceo and founder hat on how are you thinking about enhancing that go to market um on behalf of of coiled um and and asked sort of do you have a quick mental model as to how you're thinking about sort of pouring gasoline on this on this python slash desk adoption in the enterprise like where are you just going to let it ride from an open source standpoint yeah no um yeah i actually come from more of a consulting background like i've had to fight for dollars in the past people directly this whole vc thing is new for us um uh so yeah we're not just doing just like straight up open source until we clearly dominate the space i think we've kind of clearly dominated the space right now right now the goal is to capture that grassroots adoption and see what we can do to grow it and see what we can do to monetize it certainly bottoms up makes a lot of sense for us and we actually started with a top-down enterprise sale mostly because i happen to know a bunch of finance companies using that um we were urged early on to go bottoms up and that actually was i think really good advice uh we still have like 95 of our money comes from a few large enterprises mostly in the financial services but you know like 99 of our users are small teams um and understand of our information coming in is small teams so we're doing a bit of both um i would say for the next year we're probably very much focused on bottoms up and that's that's that sort of team lead role that i talked about before and that that conversation is just really repeatable uh the sales cycle for that is is weeks not years um and that's like definitely been uh a lot more informative than the sort of long selling to it selling to top down uh sales cycles which we've done but it's like it takes nine months and we're 12 months old so it's we had to do something those intervening nine months engineering-wise um but it is a challenge like i do we have enough large finance companies and enough like government organizations for example all of who want to use us and all of them are very strictly on-prem that it you know there's you know many millions of dollars in ar we could capture if we wanted to um we don't have to we've got venture capital money to like do this bottom-up thing for a while it's very tempting to go back over lessons from databricks or snowflakes say go bottoms up but it's it's not entirely clear that that's the right case for us it's an interesting i think about today um i think it's a very open very interesting question well i'm uh really curious to see how this evolves for you i mean obviously you're a community oriented kind of a guy but it's nice that at least you know that there's big company interest and there are revenues to capture their um depending on sort of how the company might want to pivot its go to market if if that becomes necessary in the future yeah i think again bottoms up makes sense we're kind of building all those leagues in the future but it's nice knowing that we are able to close large top-down fields because it makes it very clear we're on the right path and so the planets you know make a few large enterprises happy while building out a large set of customers and then you know a year two or three in the future you know turn those smaller customers into larger customers correct um besides the go to market questions like what's been the most challenging thing about starting a company as an engineer i think there's lots of people in our community who would would love to know like what what's waiting around the corner for them i would say there's not a most challenging thing there's a there's a every couple of weeks there's a new most challenging thing like the job that never stops changing um uh we've actually started hiring a bunch right now uh and actually having more seasoned people in the company realizing i don't have to do everything on my own has been a huge way off my shoulders that first year was really really hard um maybe that's maybe a good um that's a good lead into a solid piece of advice i think as as maybe like a a good engineer or a good manager i'm very used to being able to say like look if things going slowly i can just like put my head down and solve this problem quickly and i think now that i'm no longer good at my job right like being a founder of ceo like it's just a constant flow of not being good at your job i think um learning to delegate more is probably something that people with my background probably do need to figure out sooner rather than later um yeah i'm learning that you need to like rely on other smart people i think it's also something that um i've always the people have hired have always been like slightly they've been like extensions of me i've like given them engineering time now i'm hiring people who actually are more experienced than me and are taking on functions much better than i can and i think getting that shift right at the center of the organization rather than the top of their organization a shift that needs to happen especially again for engineers because we tend to be maybe high performing individuals in the thing yeah absolutely i think an engineering founder has to learn how to hire two different groups of people there's people who are fundamentally different than you first of all just in personality or approach or style and then there's um you have to build the sales org and the marketing org and the you know the the recruiting org and there's all these different types of roles to hire for as well so um it's it's challenging as a founder to figure out one's path through all of these minefields but it helps to have great investors advisors mentors friends other founders to talk to um so you know thank god we have a community of people who are willing to support us and in starting sort of bringing these visions to life in a way that actually can scale through a team when some of us haven't had huge experience in building huge teams before we started our first companies so yeah definitely just to echo that a little bit there's a huge amount of empathy coming from other people who've done this before which is yeah that's awesome well i'm glad you're able to step into that it's a it's a i think it's a tremendous time to be a founder and to be starting a company because there's so many different kinds of help um and networks of support that you can get right now so it's a golden golden era to start a company for certain certainly yeah but i'll also say i mean from a from an engineering perspective like there are things that we've achieved through a company for the open source community that we never could have done on our own but in my little call to action here for a second so you asked before hey if i'm at a company and i want to figure out how to add a scale desk how do i how do i do that right now if you're a python user you can install coil it's a python library you can pivot a condom style coiled and you can import coil into anywhere you're using python and you can then like get a fully secure fully managed fully authenticated fully controlled running on your cloud of choice inside of your own vpc in your own account in about two minutes right and so like that huge sort of devops problem that a lot of these early users were facing is solved um if you're willing to pay coiled some fraction of what you would pay normally um and so i think that like again is a thing that we really couldn't have solved as an open source community building service like that i think does require a company behind it to keep something up running all the time to hire different kinds of individuals to work on and that capability is genuinely novel that's a capability that has focused if you go like the das conference from last year a third of the conference was about how to solve this problem in different ways and it's now like i think coil has made it trivial and that i think is is a big boost to productivity both for us as a company but also to you know climate science or or or medicine or there's a lot of use cases that we're serving in our broader mission that i think are are very well served by the fact that we have built this thing as a company it's uh yeah that's that's really well said and um i think that goes back to the previous question about um you know you not having cognitive dissonance over both contributing to open source and building a commercially viable company around it because there are truly powerful things that happen on both ends of that decision that um that that couldn't exist otherwise yeah the open source community is very good at some things it's very bad at others and you need a healthy mix of non-profit for-profit institutions all mixed together and building that culture is interesting right and that's i think we've done a very good job of that in the python space probably i think anaconda for example did that quite well um but that's that's certainly a challenge and there's lots of interesting back and forth yeah for sure it's fun i hope that continues maybe going back to wes's earlier point but i think right now we're in a good spot there's a lot of money floating around there's a lot of respect to open source contributors and community it's a good time i think companies like ours are are furthering that same good time but there's a fight to be made there right and we have to make sure that you know other shark-like companies don't show up and right and just take and just and just monetize and that's really important that i and other folks like wes have well yeah that's that's really well said and um i've enjoyed the conversation that it's been um great to chat from chat with you um before i let you go i'm curious if there's something that surprised you most about working in the data ecosystem over the years um i would say personally it's been it's been amazing fun i think there's there's very few disciplines where you get to touch so many different kinds of problems fairly deeply i feel like researchers everywhere just welcome you into their hardest problems asking for advice and that's a really unique opportunity i'd recommend this field for anybody who's sort of naturally curious about the world i would say maybe a flip to that it's also very surprising how frequently their problems are all the same problems um and so i think we're also like well positioned playing that role to unite a lot of these different groups and bring them together so if you go to again a das conference maybe another call action you go to summit.org we're doing a user conference in may so people who may be interested in subscribing to that great and you see astronomers and you see geneticists and you see hedge fund managers and you see banks all talking about file formats uh and it's actually very fun to have those all those very smart people all in the same room um it's great like it's a it's a very interesting convergence of different intellectual disciplines that's happening right now right around technology and that's summit.dasc.org yes great we'll make sure and uh put that in the chat window so folks can find that link easily well well thanks matt it's uh been great to chat with you i really appreciate you being here with us today yeah this is a blast thanks pete so just to let our community know uh there's a couple of quick things i wanted to run run by with you first of all thanks again to ibm uh who sponsored this episode has been a big supporter of dc thursday don't forget to hit subscribe and hit the bell icon so that you get notifications when we go live again next time you can leave us comments about today's episode via our feedback form if you are willing to sound off and let us know what we could do better we greatly appreciate that and then finally our next episode will be on march 18th and we'll have will falcon who is very involved also in the the python community this time on the pie torch side and will's the founder of a company called grid ai so he'll be joining us for our next dc thursday on march 18th

Info

Channel: Data Council

Views: 412

Rating: 4.7777777 out of 5

Keywords: data engineering, data pipelines, data catalogs

Id: tEbJk8i1DRw

Channel Id: undefined

Length: 48min 44sec (2924 seconds)

Published: Thu Mar 04 2021