DC_THURS on Great Expectations w/ James Campbell

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
well welcome everyone to the next episode of dc thursday we have a really awesome episode for you today that i've been looking forward to for a while and most of you many of you in the community maybe most of you have heard of the great expectations data testing framework and so our guest today is one of the co-authors of the project i'll introduce him in a second but great expectations is an amazing open source tool that you can use to test the data flowing through your systems and the project debuted a data council i think it was in 2000 early 2019 and we've been tracking the project for a while so we've seen a lot of their growth we've had the co-founder and co-author abe gong on the show before um but our guest today is james campbell who is another one of the brains behind the project and um james and i are looking forward to having a chat about not just an update on great expectations the project um there are some new cool features that we're going to talk about including profilers and extensibility but we're also going to talk about james background as an analyst in some interesting sectors and so um james i'm really excited to welcome you to the show thank you so much i'm really excited to be here yeah it's great great to have you here and you have a really interesting background so i want to get right into it um you were an analyst in several different types of roles and i know that includes even some work for the nsa which you may or may not be able to share with us many details about um but i'm curious to hear a little bit about the the early part of your career and how you got into data in the first place uh absolutely i it's i was i was reflecting as uh as i was thinking about this conversation on what the trajectory for me has been and i think it's really there's a really interesting dynamic where on the one hand i've gotten to see a lot of change and like you mentioned move between different roles and i can mention a bit about that in a second but it has also been a a very direct trajectory in some ways when i go back to thinking about my academic background i actually recently looked at my college application and they asked the question what would you want to study and i said and i you know you know at the time everyone is kind of canonically uncertain but i i marked down math and philosophy because that's what was the interesting area for me that intersection and that has really continued to be true for me and that is what i ended up studying uh as an undergrad even and then throughout my career i've gotten to move between more quantitative and qualitative analytic roles right out of college i i worked as a economic analyst we were doing mostly securities litigation expert witness testimony and then then i moved into the intelligence community like you mentioned and um you know getting to rely on some i was really lucky to have fantastic mentors uh when i was very young and and get to work in uh networking and so i moved into computer security and as that became something that you know we had a tremendous amount of resources and i got a chance to grow i think with the zeitgeist into the big data movement and experience the perils but also the joys of getting to work on those kinds of problems and now of course you're the cto of super conductive cto and co-founder which is the the company behind great expectations um how did you discover that you wanted to start a company as an engineer you know i i think it was a it was interesting for me to think about the prospect of leaving government i have always been really drawn to public service and so i love that aspect of work um but again in that kind of you know trajectory comment i think i i remember being very interested from a young age in the you know the challenges of entrepreneurship and in many ways the the problems are very different because we're you know more resource constrained in some ways um but also the opportunities are different we have a lot more ability to uh you know just be out in the community and being part of an open source project is not a way that i even realized would have been possible in terms of starting a company long ago and it's been really phenomenal to just see how much community engagement and interest and contribution there is and how did you get interested in open source projects or sort of get involved in that space in the first place is there a story there well i've you know i have had interestingly maybe software engineering was was always for me more of a means to an end um i mostly was was learning and and doing coursework focused around being able to uh you know solve particular problems you know whether that was cyber security questions and take advantage of novel data sets um or the economic analysis that i mentioned earlier for me i think um the the benefit of open source was was therefore just apparent throughout my career because i i i remember you know working with hadoop when it was literally we had some old computers under a desk trying to figure out if we could process more log data and um the the idea that we could take advantage of uh so much really impressive thought and work that a huge community of people had contributed to was always really exciting and uh in in my last role in the government i was with a lab uh called the nc state laboratory for analytic sciences which is sponsored by the intelligence community and where our explicit goal was to reach out and engage more broadly with the community to make sure that uh the the best minds in the world were working on the kinds of problems that we were interested in and so again i just felt like i got to see what incredible things can happen when you harness the power of an incredible community um and and you know i never i think dreamed would when i started working on great expectations that it would become the project that it has um you know we i was looking just the other day you know there there are well over a hundred different people who have contributed substantively to great expectations and representing companies big and small you know students and professionals um and and we know of course you know for every contributor there are many many more users so and and i find the origin story of the company um well and the project um both sort of equally interesting of course great expectations as the open source project and super conductive is the company behind great expectations but um you and abe started off doing consulting correct and great expectations the open source tool essentially fell out of that work as you were looking for repeatable ways to do things with clients um can you talk to us a little bit about that and a little bit more about the origin story specifically of the great expectations project sure you know it's actually it's it's it's even more serendipitous than that in fact um abe and i have known each other since we were very very young our our parents uh had studied together uh in the bay area in the 70s and um we were we were in different areas you know he was working in the healthcare space as you know through and through an entrepreneur and had worked on a number of different companies already and as he was really interested in making changes to the healthcare space he saw the the same problems that i had experienced in the intelligence community and just by virtue of having been friends for a long time we had a routine call to get you know the two of us and several other people that we knew and we were talking through things that we were interested in and i said you know i'm going down to this lab where i can uh pursue my passion in many ways and this is one of the things i want to work on and he said you know no kidding i've been working on the same thing so we began collaborating um you know for me as as uh purely uh you know exclusively on the open source part not engaged at all in my contract while he was doing that and then it's only been since i joined the company uh about two years ago that we've you know and then actually only about a year ago did we decide to really pivot fully into great expectations instead of the consulting side well that's a great founding story it's always illuminating to understand that the details and the circumstances surround surrounding how two founders met because i know there's lots of engineers in our community who are looking to start companies and that's one of the things that's exciting to me about getting folks together data council is i don't just want them to find you know career opportunities or jobs or get smarter or discover open source i i i like them to start companies um when when they're so inclined and so um you know being able to match uh founders together is one of the exciting things that comes along with that so you certainly have a gift for bringing people together oh thanks um i'm excited to um you know for mass vaccinations to roll out for us to be able to run in-person events again for sure because you know it's great to see people in the video in the video zoom screen but it's not it's not exactly the same as we all know um well that's a that's a great sort of background of of the company and the project um the other thing you mentioned that's just intriguing to me that i have to ask about is the intersection of math and philosophy so before we move on from your background um is there anything you want to say about encouraging people to get a philosophy background if they actually want to go into data you know it's it's funny um i think long long ago there was this there was a huge pipeline out of philosophy departments into uh software engineering um in the you know in the 60s and 70s and i think there's been a there's always been a really deep link between computing and and philosophy whether that's um you know a lot of 20th century analytic work and and work on uh you know fundamentals of computation um or what's been and our nation's or and what's been particularly interesting to me is is in the space of really understanding confidence um and so i i think there's actually a way that that we're seeing that is as a huge resurgence that absolutely to you know to your point uh really calls out for a more need to invest in philosophy and that's in the space of explainable ai and ethics of ai you know as as models become more complex um but actually i think just as important as the the relationship between the construction of training sets and the influence of that training set on the population becomes more pronounced in other words where you can no longer assume that there's independence between the predictions generated by gen one of the model and the training data used for building gen2 um we need to i think think very carefully about what we what we want the system to do um and in fact in many ways that's the the philosophical side of great expectations when i when i've talked a lot about uh what we're trying to do sometimes i use the the phrase analytic integrity um meaning really uh understanding with a lot of precision how confident we can be in the judgments of any given system which will depend on the characteristics of the system but also of the environment that it's located and operating in so yeah i couldn't i could i could go on forever about that but i love your i love the way to frame it you like absolutely i think there is tremendous value from uh being able to draw on the the work that has come before us in thinking about those kinds of problems yeah perhaps it's valid to say that philosoph philosophy first like to think in terms of frameworks they're establishing a framework for a way to see the world and it's probably not too far off base to think of ge as establishing such a framework and a language and abstractions and all of these notions um in the world of data so of course some people just might say that they're both black arts data and philosophy but but uh i i think i can understand maybe some of the some of the esoteric connection there so um i think a lot of folks in the community are already familiar with ge but um just before we get into some of the updates and some of the really cool features that i know we want to spend a bunch of time on um can you give us a quick background of the project what it does um the main components of it and how folks are using it currently today absolutely so uh great expectations is at its core about helping people express expectations about data and verify that those expectations hold on data sets so data quality is really what we're about but quality is not doesn't mean that it's you know pristine in any particular way quality is always tied to the purpose for which it's being used so it's about being able to draw that line one of the things that we're really committed to like you mentioned earlier with great expectations is being a language a way of communicating so great expectations provides a mechanism for users to declare expectations which can be things like i expect a column to exist in a table or i expect values in a column not to be null or i expect values to come from a particular set that we've agreed on with the product engineering team but they can also be uh you know related to aggregate characteristics i expect the mean of the values in a column to have some value or be in some range or i expect the distribution of values in a batch of data to fall in some range and because we're in the business fundamentally of being able to declare and then test expectations about data we have a really natural way of prevent presenting that and so we call that data docs and so one of the key features that great expectations offers is the ability to visualize what the expectations you have are in a prescriptive way so you know i the data in uh this or the the values in this column should not be null sort of a normative prescriptive statement um but then also uh in a descriptive way um you know like there are 15 values in this column and then finally putting those together in a diagnostic way so you can get a pretty web page that that says there should be no null values but there were 15 hole values warning and then you can you can uh tie that into an alerting system uh that uh runs together right in your infrastructure so we have a lot of users who for example deploy great expectations in their airflow pipeline uh or in whatever dag runner they use prefect uh dagster you know there that could be a cron job and uh they get this uh this diagnostic report about whether the data that they are processing meets expectations um so that that's really what ge does um it it really helps you declare and and think carefully about what your expectations for data are and then validate that they're true got it and and how many folks do you have using ge at the time how do you think of the the size of the community and i know it's a growing open source project but other specific um things that you get excited about in terms of kpis or metrics or how do you think about that and then the impact that you're having on the community great uh great question and that goes back to our comment earlier about just being excited to be part of an open source community um one you know it's very difficult as an open source library to say with with confidence how many users there are because there are lots of users uh who we probably never know about um but we're really focused on building community so a couple of the kpis that i would flag that i think are indicative of that is we have over a thousand users in our slack community now and you know we see job postings routinely in there uh you know people can advertise uh their problems and and then of course there's there's support for ge of course in the community as well um but it looks like it looks like you have 2200 people in your slack community okay over two thousand as of right now because i just signed in to to check it out and i see twenty two over twenty two hundred people so just a heads up the numbers are even rosier than you than you thought i'm underselling that um well so that the the other way that i think we we measure community and i mentioned this one earlier is just the number of actual contributors into the code base which isn't 100 uh already it's over 120 i think um in fact let me try and redeem myself on the side i think it was 126 when i looked last week it's gone up a little bit since then so those are the measures that i get most excited about um of course there are other ways that we you know we do have some instrumentation in the library that helps us sort of see how many validations are being run got it well this is a good place to mention to folks who are listening that um we do want to take questions for james in the chat so if you have questions for james about his um background or experience or about ge itself on the project feel free to to pop them in the youtube chat and we'll do our best to get to them as the show unfolds um so james i think we we understand what what ge does some of the origin story um maybe going back to higher level sort of concept um surrounding design principles i'm i'm curious how you guys thought through what you wanted to accomplish in the ge project because i've always been impressed at the thought that's gone into the structure of that framework and of those abstractions you guys are very much abstraction designers and in conversations with you and with abe and even with the rest of the team some of whom met together at data council incidentally which is another cool story um the team's always been like very thoughtful about the way they've rolled things out so um as as an example i'm curious if you can sort of explain to us some of the design principles behind the project and how you prioritize trade-offs and what things you cared about um if you could walk us through some of the main thinking in terms of how you thought to the project design that would be super interesting i think for our audience well personally thank you you know it's it's really been fun to get to have a community that we can discuss these things with and i think one of the things that's really helped with that kind of approach to design is that we haven't been exclusively focused in one domain um and and so one of the design principles is that we want to become a language we want to be declarative and expressive um that means that we often use very very verbose names uh one of the expectations that i i remember you know when we were just thinking through how how should this work uh was was we described literally in in excruciating detail uh in the name of an expectation what statistical test we're going to run and and and then the the parameters become the uh the the the parameters to the test so so again that kind of transparency and verbosity are uh things that are really important for us and for our users and then as we've been able to do more work in renderers and we can talk about that a little bit more later but more work around the data docs we're we're really focused on making sure that we never lose hold of exactly what's happening and and exposing that to users on maintaining our our credibility first uh when we're doing any kind of new expectation another thing that we really are are focusing on is making sure that the prescriptive or descriptive or any other way that an expectation is uh is presented it's it's understandable to as many users as possible and ideally at different levels so we want to make it possible to look into the result of a validation and get a lot of detail while also being able to get a summary if that's what's relevant for you so making it easy to structure the information right away that's on the on the expectation side interestingly i think some of the most most challenging design decisions that we've faced have been around how we interact with others infrastructure uh you know there's there's a real interesting tension where great expectations is definitely doing compute right we're computing the mean of a column we're computing whether uh values match or regex using some other engine like pandas or your database or warehouse or spark but uh we also don't want to become uh a tool that that is used for for transformation we need to we need to have our tests live really close to the raw data that's how the value comes uh you know what we really offer people i think is the ability to to understand what the data is that they are getting as well as what the data is that they are emitting in fact that's one of the i think really subtle design questions is what exactly do you test and i think often uh we find that testing your inputs is actually more important than testing your outputs great expectation in terms of i don't know if important is the right word but it offers more insight often to the data team um and of course eventually i think they usually mature into doing both but in terms of getting started so is that to say that is that to say that you're trying to stay distant from transformation processes and you're sort of conscious very conscious of trying not to blur that line or not encourage your users as the case may be to blur that line in their own systems yes that's exactly right um now again you know what what we are doing is we are introducing some of those capabilities but we're doing it in a very very measured uh way and and going back to the point about you know philosophy when you think about the expressivity of a computing system there are several pretty clear inflection points like at what point do you allow somebody to defer computation or provide a parameter you know use a variable instead of a a parameter that's known in vast at what point do you fill these variables in also at what point do you provide for condition uh in in the flow of computing at what point do we allow grouping of data and so we've we've been introducing those concepts in a very careful way so that we never lose track of of being data quality like that is is a core promise that ge offers um and again just to just to hit on that on the last thing i was saying with respect to integrating with dag runners so um we we absolutely want people to be able to test data in whatever transformed state it is and so we've done a lot of work about making sure that we also can be really clear about what has been tested so the concept of a data source a data connector a batch you know what exactly is running and and often that confuses people you know who hear batch and they're like well can we can we use ge on streaming data and the answer is yes um in for exactly the reason that we were just describing where we provide multiple lenses into any given set of data so that you can evaluate it for the different purposes and in the different times and places that are most relevant got it um and along those lines like the i'm sure there's different philosophies of so-called anomaly detection if you think about sort of the broadest category that maybe ge would play in how do you compare and contrast some of the ways that ge approaches this field versus um other other projects or other models that you've seen um that that's another one that i really love as a question because it lets me it lets me be philosophical and a lot of people ask that question all the time um i think the biggest way that i like to think about the difference is that um with ge there's a huge amount of intentionality around exposing to the user uh clarity of what they are ultimately testing um and i mentioned that question about what point parameters get filled in and so i think you can think of anomaly detection um as as so firstly just you know you can absolutely express basically any kind of anomaly detection using the language of great expectations uh but what we what we're really excited about is the way that we uh bring clarity to what uh what it is that you're testing the other thing that's really different or maybe so you know we can dive into that uh in a bit later actually because we're doing some new work in that area but the other thing that's that's just always been really different is um and i sometimes like to use from out of band for this so we bring out of band information meaning we're testing data using knowledge and insights that you as a data team have derived from outside of the data set you know the the the knowledge of what the acceptable values for some code field are in general doesn't come from inspecting the data set it comes from understanding the system that's populating that field and so we allow you to bring that conversation with your colleague into your understanding of the data and into the testing for that data set so that you're not ex relying exclusively on anomalies in terms of deviations from what has been observed past but but more in a fuller sense of the word anomalies meaning deviations from what could properly be expected based on the behavior of the system and how how exactly does that work can you give us like a specific example of how one might sort of translate a conversation sort of into the the you know from a different business unit sort of into the the ge code like what does that look like what does that really mean yeah um absolutely and in fact i'll get the i'll pick an example i think that that really highlights a lot of the value of ge um because it can because we can sort of play around with at what point that conversation happens so um you know i mentioned that uh that i did a lot of work on on banking data and banking data it's common to have uh a given transaction broken up uh into different codes and and you know so maybe there be uh the total value of a transfer between two accounts but also what was the cost basis of that transfer uh what portion of the transfers taxes what you know whether there was withholding that happened so there may be a number of different rows as with with particular codes now uh let's suppose that we that a company offers a new feature they offer a new product and they uh they choose to encode that with a new code in this data set well there may be a huge number of dashboards throughout a large organization that rely on on the data and so for example if if you have financial data where we we might assume that one code is is the total and then the other codes represent a breakdown of of that total into into sub components um if if we add a new categorization or new code or tax law changes or there's new regulation any dashboard that relied on on basically summing up the values of some set of codes would break i mean it would be subtle right it wouldn't it wouldn't be obvious but the but in some subset of the transactions you you would you would lose uh fidelity so uh the the way that conversation might play out is you know in a lot of organizations i've been you know maybe a data team would basically send out a big email well we're we're changing this system we're you know starting on march 3rd we're going to add a new code update accordingly hope you get that in that that note with great expectations you could have put a test and you still can put a test around well i expect the values uh for this code to be exclusively drawn from the set uh that that i'm summing over and that way as soon as a new value appears you'd immediately get an alert about it but you're not you're not learning that from the data because your other systems aren't relying just on the data right you you've got a dashboard or an analytic system that knows what those code means right they're actually semantically meaningful to it in a way that another anomaly detection system may not be able to take advantage of so but this is a way to just just correct me if i'm wrong but this would be a way of detecting when a value in that new code emerges not so much the schema validation of whether or not the new code is is is sort of required is that is that an appropriate distinction yes that that's definitely true now you know we just like you can express anomaly detection you can express schema validation uh with ge but yes in that example you're you're absolutely right it's it's really about understanding changes in uh in a complex data system got it but everybody's system is complex whether they know it or not right well isn't that true um but it sounds but you you indicated that you can also um sort of describe schemas and then for schema validation in ge how does that work briefly so just like we had you know i mentioned this values in a set type of expectation there are there are expectations about the storage type of values in a column uh the names of columns orders of columns there are uniqueness you know we could expect column values to be unique for example to express a uniqueness constraint in a schema so it's really about what the expectations that you encode are one of the things that's really neat about that and this is something one of my colleagues has been working on just this week is that you can translate expectation suites between different environments so you can express a schema that you know normally maybe would be you know normally you might express a uniqueness constraint in sql as a uniqueness constraint but that you know that's not something that a csv parser uh exposes uh so you want to be able to verify that a new uh piece of data valid you know matches your schema even before it's been attempted to load so you can get a very detailed report about what's going on great expectations is great for that great that makes perfect sense well i want to talk about some of the new cool stuff in geneva as well because you guys have been really busy at work i've seen some great updates fly by um you know talk to us about what's happened in the last few months and sort of what the main categories are of work that you guys have been doing you're right this is really exciting so uh the main area we've been trying to improve is um well i shouldn't say domain there are several areas we've done a lot i've done a lot lately so one of the things i'll talk about is uh the the modular nature of expectations and this is really about making it easier for users to contribute custom expectations and manage custom expectations um in fact i'm going to be doing a deep dive into the real guts of modular expectations uh later today that would become that will be available as a uh as a webinar okay with module yeah thank you so with modular expectations what we're doing is um taking all of the logic about an expectation meaning what metrics it relies on how those metrics are evaluated and compared what it needs to return how it should be rendered in those descriptive prescriptive diagnostic ways and potentially in other languages as well and we're encapsulating that into a modular unit a python class that can be imported at runtime dynamically and registered and made available during validation and so what that means is it's much easier for people to encapsulate all of the logic of their custom expectations which people have been writing in great expectations from the very beginning into something that's portable and shareable and so as a result we've actually just just launched a new package called great expectations experimental which is which is built out of a contrib module in the great expectations repository and and where we really want to open up to the entire community the ability to add expectations whatever form they want and really make that easy one of my colleagues eugene mandel has been doing a lot of work around building up templates and guides and documentation for making it super easy author expectations um there are also some really fun performance improvements that we get as a result of this this modular shift because it allows us to store metrics and more easily express relationships between batches and so forth and do you have do you have a mechanism for the community to contribute um their expectations back to some kind of gallery or i'm not sure if i've heard anything specifically about that so i don't want to ask well i don't i don't want to like i don't want to you know you just can't you just wrote the launch announcement i i don't think we um well so that so you're you're definitely in the in the we have not launched the gallery yet um but yes exactly one of the things that we've done as well uh as well as all the things i mentioned in these expectations is we've provided some mechanisms for you to add metadata um and and some of you know test examples for example that are rendered into a really a really beautiful accessible page so that you can search and find others expectations uh create them and see them yourself like i mentioned uh this repository is now uh published directly on pi pi already it's available um for people to contribute to with a really low bar we have literally templates for expectations so lots more in that space what i was mentioning is kind of the guts that made that possible but you're exactly right where we're going very very exciting well we'll we'll be um waiting with baited breath for um some official announcements in that capacity when when the time comes hopefully soon yes well well uh the the expectations extensibility sounds really exciting um have you seen some good uptick in interest from the community in that so far i'm sure you get you know early users in the slack channel and you get early feedback from folks um so i assume the the response so far has been positive yes um we've still we've still marked that api as experimental um and so we haven't been encouraging uh people with production deployments to move over yet but what we are doing is we're actually sponsoring several hackathons uh over the next few weeks where uh where people who want to contribute expectations can have a live support from somebody on our team and uh pair together with others so so it's really you know bring your idea bring what you want to be able to express in the language of great expectations and we'll help you make that possible that's great well um con on us to help you spread the word to the community to our community when that time comes so if folks want to watch the data council newsletter or the great expectations slack group i'm sure they'll be i'm hearing about uh the those hackathon announcements soon thank you absolutely um so so what else i know there's more than just the extensibility there's um i've been hearing stuff about profilers plural which is interesting to me and i kind of wanted to ask you about i've been hearing things about um well you mentioned these new batch style data connectors you mentioned it briefly um just a few minutes ago um talk to us a little bit about about these aspects and what that what that means absolutely so for for uh for batches first i think you know i mentioned this this idea that we're making it possible to have a variety of lenses defined into a data set with with the new style data connector and batches great expectations will really help you manage the promises around relationships between batches more easily so one of the things that people really like about traditional anomaly detection systems is in some ways it can be a very useful black box you know put it on a data set ask it to watch a number of metrics and say did these metrics change in a surprising way it's really useful feature so in order to make that easier for people to to do in great expectations what we've done with batches is provide a really strong sense of ordering and partitioning of of batches of data so that with a data connector configured you can define different ways to slice a data set whether that's a load time or whether that's a date key or whether that's uh some other key you know that's relevant for for dividing up a data set um one of the things that that does to to get to your other question is unlock much richer profilers um so you know you said why plural well because there's not just one way to learn from a data set and so what we're doing with profilers is making profilers composed out of rules that users can author and then uh it's possible to just combine those rules in different ways and so a rule might be something from your organization like i know that we have a convention where any column that ends in underscore dt should be a date time or it could be that any column that has the the term id in it uh needs to be unique or you know whatever the conventions or or uh things that you have in your organization that kind of out-of-band knowledge but they can also be things that you learn like we have we've done some interesting work around semantic typing meaning you know looking at you know estimating what kind of of data is in a particular column based on its cardinality if there's there's every value is unique it's much more likely to be an id whereas there's only a small number of values maybe this is a code linked to something else so with that with profilers what we're doing is making it possible again to decompose what your knowledge is into rules and then we're providing users several profilers that also are able to look across multiple batches so that you can generate expected metrics based on much more information than you could have previously great expectations very interesting um i want to go to a couple questions because we're getting some questions from the community and i think they may have to do with some of this functionality um so brian is asking i've been meaning to ask your team about a feature i've been looking forward to the feature is to make it easier to do comparisons between two batches of data example month over month is this is this the sort of behavior that you're referring to yes uh absolutely so um for a long time we've had what are called evaluation parameters um which allow you to uh defer the the identification of a particular expected value uh till a later time and so in the past that's how users have handled this month-over-month problem uh they validate one month's data and then they use an evaluation parameter that describes that observed value from the previous month um it's pretty clunky to be honest i've never loved it it works but it's it's not really what we what we were wanting to do and the reason was it wasn't it wasn't really clear how to describe that concept it's very natural and intuitive for us what what does month over month mean but for great expectations we needed those promises of the data connector that provides an ordering of batches so that we can say okay now have a data connector that produces monthly batches and the if i look at the we you know we use the term partition like i mentioned and then partitions have an index so you can just say give me the previous index of partition uh now truth in advertising the all of the machinery is in place we have not kind of packaged that up as a feature we we're planning to uh bring that out together with profilers so that the kind of functionality is uh completely uh you know baked in there but but absolutely right that's exactly where we're going and i'm glad that that jumped out at you got it i think it's related to another question from the wi-fi i imagine ge is very helpful when ingesting third-party data as it could flag deviations and input data over time how have people use ge to perform holistic integrity checks on one source and track how the source changes over time it seems like this is a similar type of use case as far as i can tell is that right well i yes although i think often um this is this is even a more traditional ge use case this is very much the one we were describing where there are two teams involved um my favorite way we had one one team that talked to us about you know what what they use ge for is kind of writing that awkward email to a vendor that says well you you said you were gonna give me this but i looked at it and it's not like that and that could be you know that's it's it's a tough email to write because you know you need to be precise well ge makes nice so if you're buying data from an external vendor uh you can write expectations about how it should behave and and how it behaved in the past and then if that vendor introduces a new code or um you know any number of things happen the total number of people in the world goes down instead of up or something that just shouldn't ever happen uh you can uh you can look at that or you can have some real clarity around what's happened in order to communicate back to the vendor got it yeah that makes sense especially with these uh third-party data sets and folks integrating these from from multiple sources um so one more andrew is asking in the financial services example you gave earlier would it make sense for the downstream systems to publish their expectations to the producing system yes yes indeed we actually have a term we use for that we call that a data contract um and um i think that's really a phenomenal case in fact uh with uh with data docs uh it becomes possible to publish the uh the prescriptive version of an expectation suite you know in a readable form put that up on a website for example make that available to them but yes also to actually share the expectation suite i think that's a really good use case got it awesome great questions and great to hear from folks in the community um so i just wanted to go back to um to any other um clarification you wanted to or if you wanted to dig deeper into profilers because i know i kind of interrupted you um you know during that section um is there anything else in the profilers that are interesting for users to know or are people building additional tooling around the profilers or is this um are they are they automatically kicked off or is this something that's kind of uh you know do it once and and um you know do do it out of band or how how are people using the profilers so far i'm just curious more color there yeah i i think that's actually an area that we're likely to see some continued growth from from community and i don't think there has really emerged yet a standard practice for how that's done in fact uh abe and i had a conversation just uh a couple of weeks ago in which we were both we sort of were sharing back and forth some uh candidate rules that we would have uh written and and you know i looked at his and i was like i don't think i would i think that should be an expectation sweet not a profiler and he looked at mine i don't know i mean i think that should uh be more automated and and so we're we're we're still seeing it i think just like with modular expectations one of the things that we're really focused on is making sure that it's easy for people to express what they want with great expectations and that we can position the uh the library to become a tool that's that's ubiquitous that that people use to really describe what they uh what they expect so the real key thing that i'd emphasize on profilers is that it's about it's about composability it's about the ability to express what you want and make that easier to experiment with and um and and and then of course also for us to help you know people learn their data better um so i i think you know i don't i wish i could say that there is one profiler to rule them all um but i think it's also lovely that that that's not the case that there's so much opportunity for people to still uh compose and build yeah that's great well i'm really excited to see how the community takes to these new features and um and i'm sure you have even more goodness coming up um this year so it'll be an exciting year to to track the the ge community and um i can't wait to see how how these things continue to emerge thank you thank you very much it's been uh it's again been really fun and i think you know there are some things i you know that to kind of hint for a second where where i think we're also going in and um i think it was brian's question about about these data contracts one of the things that we really want to do is make it uh easier for people to have those collaborative conversations and engagements across teams and and organizations so we've actually started working on building out some some sas features around around ge that will support those kind of engagements that is really hard to do when you're working in the space of configurations and teams and write in your your engineering pipeline so i think that's what i encourage people to to kind of look out for and for us to you know we would we'd love to engage with people on that to identify how we can best support those kinds of interactions and engagements yeah the collaboration across teams is so important and um it's gratifying to see how much has come up in our conversation because establishing a language and in a process and sort of the the edges where teams are able to collaborate with each other is super important and we've seen that in the data council community over the years you know we started off just as a data engineering community because that was sort of underserved but then there's all these edges between data engineering and data science and the the analysts and then and then the business users who are sort of consuming the analysis and so um i i like how you guys are sort of going for this collaborative collaborative model and trying to establish a baseline and common ground so that teams can start to work with each other on some of these hairy horn thorny data problems so yeah absolutely thank you well i just i wanted to ask you um before we let you go uh if you have any advice for others who are thinking about starting a data oriented company is open source the way to go um would you i mean obviously that's the road that you guys chose with ge and that's maybe circumstantial or maybe that was the universe um you know sort of putting a new lap but do you have any advice for other engineers or analysts out there data scientists in our community who might be thinking about starting a data oriented company i think going way back to where we started this conversation for me i've always been passionate about improving our ability to understand something and and and having a core problem in mind so i think uh the the things that i find most exciting the kind of tooling that i have found most exciting is where you're addressing uh a need that's tied intimately to a business problem or an analytic problem and one of the big changes that i see happening that's that's driving this next generation of change in the data ecosystem is the the much tighter integration of data production and consumption to business units there's much lower barrier now for every analyst to be able to look at and create data sets which is which has created these incredible benefits and and and therefore the need for communication that we're addressing with great expectations but there are a raft of other uh related problems and and i think being tightly tied to why the you know whatever the human aspect of the data system is is is the thing at least that at least for me that that's what i i find most uh fascinating and i think it's just so important not to lose sight of yeah that's well said um and and what surprised you most about your work in the data ecosystem so far maybe apart from sort of the importance of the human element is there is there anything else that really stands that stands out at you that you that you didn't expect i think the level of of community engagement i couldn't have anticipated um you know i was thinking oh well you know we'll we'll offer upgrade expectations and it'll be done you know data scientists will put it in the front of their notebooks and data engineers will uh put it into their pipelines and and sort of the world will be good never could have realized that you know there are just so many smart people out there who are willing to participate in building something uh so impressive you know i think for example we've gotten multiple integrations with uh other dag runners uh you know with alerting systems and tools uh people have written integrations with different storage ends for storing in and accessing expectations um and you know and they're just the world is full of incredible people so i just can't always get excited about that and it's really fun to see that in practice yeah that's great it's great to see the community um you know uh bond together around some of these projects and and really appreciate them and offer feedback and support and um and guidance usage it's uh it's tremendous to be able to we live in a really amazing time um in the open source world where we're able to build these projects in full view and um with digital community tools like slack and others to get information distributed immediately it's uh it's quite rewarding to to build a company sort of out in the open and i think you guys are a great example of how to do that best thank you very much yeah thanks for joining us james um this was a really fun conversation and it was really um awesome to be able to chat with you for an hour absolutely i i've loved it as well and and thanks for putting this together and and again i think bringing so many people together it's really fantastic yeah it's our pleasure and uh to the community out there um thanks for joining us this episode we hope that you enjoyed it you can sound off uh in the comments there's a comments link in the youtube chat if you let us know how we're doing we really appreciate it so that we can continue to make dc thursdays more beneficial and valuable for the community also if you're an engineer founder who's looking to get your own company off the ground i've been offering office hours the last few months and it's been really rewarding for me and hopefully for the community as well so we'll put a link in the chat if you'd like to avail yourself to one of the office hour sessions that i have coming up um i love to talk to technical founders about starting their own companies and um and how we can how we can best support them using the resources that we have and it's the amazing experience that we have here in the data council network um finally don't forget to subscribe um so that you get notifications for future episodes and we're looking forward to our next event on february 4th where we're having bar moses from monte carlo who'll be our guest so make sure you tune in then and i'm sure barb will have some interesting perspectives on data monitoring and data quality as well since it's such a hot topic in the space right now so thanks again to james and all of you and we'll see you next time
Info
Channel: Data Council
Views: 497
Rating: 5 out of 5
Keywords: data engineering, data pipelines, data catalogs
Id: SuXl6UY6EaM
Channel Id: undefined
Length: 56min 20sec (3380 seconds)
Published: Thu Jan 21 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.