Elasticsearch and Eland: AMA with Seth Larson

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey everyone welcome back to our tuesday live stream the from the elastic uh live virtual user group uh i am your host as usual jay miller developer advocate at elastic and this week is going to be a party because for the next hour i have probably the person outside of my team that i talked to more than anybody uh at elastic the one seth larson seth what's up man doing good oh i i haven't messaged you too much in the last week just to get ready for this uh we're gonna be talking about one of my favorite clients uh in the elastic sphere uh but before we do that welcome to everybody that's in the chat be sure to give us a good howdy-do and uh let everybody know where you're coming from and uh yeah let's jump in first seth by telling everybody kind of how you got to elastic oh yeah well so getting to elastic involved uh being a maintainer of a lot of open source projects most notably uh yearlib 3 which is an http client library it's the most downloaded package on the python package index and i've been the lead maintainer for i think two years now maybe a little bit more and then so your lib3 obviously it's used everywhere and i had some experience with api generation with uh the virtualbox api bindings for python and i think those two just combined to make me a good candidate for doing api design for an http api and so yeah started at elastic it's almost two years ago now i think and uh yeah it time is really flown to be honest but it's been a really great time so it is absolutely flown by in fact as we speak today marks my one year at elastic so like i i remember having like the conversations when you with you when i first started i'm just like oh so you know you do you are live three no big deal it's only the most popular and most act widely used you know module in the python package index no big deal though right yeah yep uh it's it's a fun one to maintain having the immediate impact like being able to make some changes in that uh module and just being able to impact like millions of people with one deploy it it's pretty it's fun and it's frightening and it's lots of different emotions come into all parts of it so it's it's really good so i i love the tweet that you sent out that said like oh when another major framework in python gets an update you know that your lip 3 is probably running something behind the scenes which means url url lib3 also gets a giant download package that day yep yep it's funny the whole world runs on seth i guess we'll have to thank goodness it doesn't thank goodness it does not i only have so many hours in the day awesome well speaking of we're going to talk about another uh package that you run because you're also responsible for the python package which i did a live stream on a while back it probably you know it did like half the justice that you could have done for it but i think we covered the basics you know send requests use bulk life's easy right yep no the other modules i maintain at elastic are the elasticsearch python client the enterprise search python client which is like app search workplace search and then i also maintain elon which is kind of maybe part of the focus of today but it's more of a data science window into the elastic stack yeah so so let's let's dive into this so i remember you mentioning elon being very similar to pandas and for for those that aren't in the python world pandas is a data framing tool and kind of the pillar around data science in python or at least one of the pillars yeah i definitely know pandas is an amazing library the the data science community wouldn't be where it is for python if it wasn't for like numpy and pandas and matplotlib and all of these amazing libraries one of the most quoted things that people love about the python ecosystem is its libraries so so i guess the the first hard question and by the way speaking of questions if you have questions for seth myself um about ilant or any of the other tools that we have if it's another tool then we have a the perfect place for that it's actually our discuss form but uh if it's related to what we're talking about today feel free to ask it this is an ama we're going to be bringing up questions as they come in but you know since elon is is supposed to be this this pandas but for elasticsearch why not just make it a plug-in like why not just make it an extension of what something is doing why is there the this is a dedicated client and and it does dedicated things right yeah so i think one of the big ones is the dependencies so because elon pulls in a whole bunch of different dependencies this is kind of like why isn't elon just a part of elasticsearch dsl or um the base client uh so it the big one is dependencies because it uses pandas and numpy and optionally a whole bunch of like machine learning libraries that was really one of the big motivations for keeping it separate another thing is that it actually originated from a completely different team it originated from the machine learning team within elastic and then kind of got adopted by the language clients as like okay we'll kind of help this go forward promote it make sure that it gets all the love that it needs uh to go into like general availability and so that's kind of the story of why why two modules why two packages nice i i think we learned a few weeks ago that like one of the biggest users of the elasticsearch clients is elastic and like the elastic teams like kibana is a big example of like how we can just beat the the javascript client you know to bring it down to its knees and just find more and more use cases for it uh so it is cool to see like we're using these tools and then we're like hey everyone else in the world could probably use this too why don't we why don't we just kind of give this over to the languages team and then let them kind of continue to update it and maintain it and we get a benefit from it but also the rest of the elastic community gets to use it as well yeah no there's a lot of internal python usage at elastic there's so cloud uses a lot of python for a lot of their like aggregation job like background scripts and then there's also the performance team which uses the client for a tool called rally which is like a macro bench working framework that we run every night and we look for performance regressions and all of these different elastic search features and i know that the the like analytics so like cloud analytics they use python extensively so yeah we have tons of internal users for every single language client which is really cool because you get uh people before you put something out new you can just tap all these shoulders and be like hey can you can you make sure that we're doing the right thing here and it's really great having internal users well i just added a note to myself to make sure we try to find someone on the rally team to to come onto the stream and show us how they're using it because i'm to me my biggest weakness in elon which i'm hoping we'll get to later today is that machine learning bit of it i've i work with a lot of data but it's always like a lot of a little data and it's never related so you know it's always kind of hard to be like i'm gonna learn pie torch today and just like or i'm gonna learn this you know machine learning librarian uh yeah so let's let's jump and talk about the the usage of elon at the basic level what are some of the things that you can do rather easily with elon that makes data consumption easier yeah so i think one of the biggest features of like elon there's two big things that elon does right now today it does like data frame support which is essentially you have data in elasticsearch and you want to manipulate it look at it just use it through this lens of pandas or data science libraries and then the other one is machine learning and so machine learning there's all of these apis that elasticsearch has that are machine learning related and so how can we interface with those how can we bring in third-party modules into elasticsearch to operationalize them so those are like the two pillars i would say that a lot of people it's pretty split like what people end up using elon for it's one of the other i actually haven't met a ton of people that use both besides like internal teams um uh so i would say that like the basic like basic stuff you can do is if you have data in elasticsearch you can create a data frame that points at that data and then just explore it in like a jupiter notebook or you know in a repel that's one of the big things that i kind of like to do with elon it feels kind of magical almost because you know that the data is like in the cluster and it's not getting pulled locally but it feels like it's local um so that is like a really magical experience and then the machine learning aspect like i you said that you don't know a ton about machine learning i also don't know a ton about like the super details about machine learning i know the basics and so what i know is that elon is really great at you have a third party model that you want to like deploy to something that is actively looking at your data as it's coming in and making decisions about that data and then like recording those in those documents and so we have this concept called like an ingest pipeline and so you can basically put this structure in front of an index or a data stream or whatever and then it'll modify your data as it's coming in and so machine learning can be a part of that pipeline and you can have like a hot swap model so you have a model living in elasticsearch and then whenever you want to update it you just say okay here's the new version of the model and then the processor will start using that new version and so elon is kind of like the glue between you have a third party model that you trained yourself to put it in the pipeline and have it start impacting my data and i mean there's a bunch of like little apis that you kind of get like oh i want to delete this model or i want to you know see which models i have but the big one like the super magical one is the i have a model and i want to deploy it to elasticsearch or there's also this little api that lets you take local data and then test it against an elastic search model that's already deployed that's also something that like people have done to like oh i want to test my model that's inelastic search to make sure that it matches the one that i have locally so that makes sense and and i think one of the things that and maybe we can even see a demo of this is when you talk about like io or like ingestion of data it's that power it's that pandas-like power that i love with elon because elon just makes working with data frames easy like if i have a data frame and you know i need to get a bunch of work done on it the fact that i can say hey let's let pandas do all of the you know applying and and merging and mutating of work and then let elon work with that resulting data frame instead of having to then go out and export it and oh i need to convert this into json and then just load it in from you know elasticsearch you know pi it's it's cool to just be like all right elon you you've kind of ridden with me this whole route pandas has done a bunch of work elon take take this data frame and just make it make it work like make make that happen and then on top of that you you've taken a lot of concern or a lot of consideration into how do you blend those panda isms and the elasticsearch isms because both of them are very ismistic i guess i don't know that's that's a new word i've coined um but they each have their own like ideas of how they want to work and you've found a good way to blend the two so that it doesn't feel like you're learning two different things it feels like you're doing both at the same time and it just kind of makes sense and rolls so if you have like a demo of that that'd be perfect but um otherwise we'll have to work with the word soup that i just made yeah i mean i have a little script in front of me that i use to upload some data some public data i can i can share that if you'd like and just kind of we can just show it off uh let me pull that up there we go well we're doing that good uh hi to everyone in the chat as we're getting set up here but it looks like we're here can you make that a little bit bigger i sure can how's that is that big enough or is that too big now no that's great all right you'll have to tell me when i am live on that side you're you're up and running you're you're all set so it looks like up and i think hold on one second seth you're cutting in and out here let me let me do some stream maintenance of my own all right i think you're good now oh okay try it now okay sorry y'all yeah too much too many bites going over the internet um so right now i kind of have a client instance right now that's configured with elastic cloud so i have like my cloud id and i've authenticated and then i have this data here which this is just like a big csv file that i downloaded from the nyc it's like restaurant health standard data so i just downloaded that this is that raw csv with modifications and i'm kind of just into pandas the way that [Music] push it into format that elasticsearch is gonna track rename more sense like dba doesn't mean anything to me but business means a lot more okay seth um i think we're still running into some some connectivity issues uh if someone in the chat can let me know are y'all having connectivity issues on your end as well it might honestly be nice too you never know yeah i i'll say i i'm in san diego where it never rains and we had a thunderstorm where the power went out you know for like four hours this morning so it could very well be my end as well but uh while we're waiting for someone in the chat to let us know if we're having connection issues i kind of just want to go back through what you were what you were mentioning there um if you can scroll up a little bit so you took one you took some data it was in a csv file i i have learned that tends to be how a lot of public data comes in i actually have a i have a conference talk coming up in like a month about how to work with public data and csvs are very emphasized in that um but then what you you did there is you did a a read csv um why not just throw it directly into into elote and say all right elon do your thing like why did you bring pandas in first yeah so throwing something into elon is essentially putting it directly into elasticsearch so a lot of times what you'll have is you'll have columns that maybe you don't need columns that like they wouldn't make sense for you so things like long id fields that just don't mean anything to you so it's not data that you want the other thing is like if your data is not in the format you want so like a lot of times dates for example if you live in united states your dates will be month day year and that's like a horrible format and so you want it more to be in like a native date format that elasticsearch is able to act on like immediately so for example your month day so that's something that i always do like rearrange the dates so that they're formatted nicely um geo points same thing like elasticsearch wants your latitude and longitude to be together it doesn't want like one field that's latitude one field that's longitude because then you would have two numeric values that like can't be used for like distance queries or any sort of geospatial query so you definitely want that geo point type that's kind of like the the bulk of what this is yeah and to kind of add on to that i mean when we mentioned earlier that like pandas has a lot of these things in place i mean one the par states up flag just in general to me has been a lifesaver because you know without having to do mapping templates or you know doing kind of this hard defining of this field's going to be a date time you can just say hey pandas when you read the csv file parse the dates and it will automatically not only recognize that there's a date there but it'll set the type to date time so that when you do go to upload it into elasticsearch elon will pick up on that and make that conversion for you and i've definitely run into that location um question of like this needs to be one one location i like how you did that though with the zip normally i have like a a separate function and then i just do an apply which i've heard you probably shouldn't do a lot of applies on your data so i'm i'm actually trying to figure out better ways to do that i mean if anything this is almost the same as an apply because it's still pulling it into python so anytime that you have a huge data set you want to stay in like pandas land and so hit doing an apply is essentially like applying python code to something that otherwise wants to work in c so anytime you can use native pandas functions to do your operations that's best but in this case i don't think there's really anything you can do to like do an f string to concatenate values and all that so yeah let's let's scroll down a little bit and see see where we're going from here yeah so this is just the dates and actually i i had no idea about the whole par state that that is actually wonderful i'm going to be using that in the future yeah i think that it's it's one of those things where like as long as it works like it works i know sometimes it'll it tends to work with like iso formatted dates like if someone throws in next tuesday it's probably going to be like nah that's a string but you know i would be really impressed if it could figure out the context of next tuesday especially based on the current date like if it's tuesday tomorrow me saying next tuesday doesn't mean tomorrow oh man someone write a library for that uh that's where you start looking at date util parsers like and then again you'd have to throw an apply in there but uh let's let's sit let's sit on this spot here because i think this was the thing that i really had to learn so elon uses keywords as the default for text right yep so text fields default as keyword and so keyword if you don't know anything about the distinction between text and keyword in elasticsearch keyword fields are really really fast at equality like you can do equals equals and that will work extremely fast but then it won't give you the things that you're kind of like no like or no elastic search for which like full text search so that would be like a text field so a lot of times what you'll see is people will have like action or a business name will be text but then they'll also do a subfield that's a keyword just to get the best of both worlds um but in this case like we're just kind of assigning types to each one so and that and that elon decision to go with keywords being the default is that just based on elastic's behavior with the rest of its apis or is that kind of hey pandas we had we had a long conversation a while back about the idea of query versus the s query and how pandas is really looking for boolean match values so was the decision to go keyword over text based more on pandas or based on like elastic search behavior i think it's based more on elastic search behavior uh so keywords uh if you could imagine they're a lot smaller to actually store and i would say a lot of the string values that you get in like data sets are not ones that you would want to be doing full text search on like for the most part they're going to be values that you are going to want to do either equality or they're like an enumeration value where it's like for in this in this case it's like grades so you have like a b c d um and so i think that's one of the big motivators for that is just like size and also if you were to put text into elasticsearch it would be keyword with if you just didn't tell it anything um another thing that is actually happening in that area is so i had like a suggestion that like okay if you if there's fields in a string like column that are huge like should we just use text because like they're pro they look like pros basically um so that that's something that i'm thinking about and it is something that i think that elasticsearch does if you go through a certain api but i don't quote me on that so yeah just by the way for those who don't know this is the community channel where we say things and then usually the answer is always it depends or let me get back everything we're saying here is often just based on our experiences not uh not anything that someone has told us to say uh but yeah i think that um this the es type overrides when you do pandas to elon to me was it's something that i constantly forget because again pandas does such a good job already of just doing type inference but as someone who does tend to work with like web apps and trying to do you know i think my most used query type is simple search string so like i'm that works with text keywords are just not friendly for that i mean they work but you're gonna run into some confusion there so i think using these type overrides to say yes i know normally this is a key word but at the end of the day like i kind of need it to be text it's not impossible to do and it's actually really easy to do compared to you know another option i guess would be creating what creating your index using elasticsearch pi setting your mappings when you create your index and then trying to append all of the data from pandas into it at that point which uh i've only discovered works sometimes um there's a lot of room for error there yeah definitely and it's kind of the it's the tough position between what is too magical right like python users i would say in general like if if you look across different languages python is one of those languages where people do expect a good amount of magic for better for worse i mean obviously explicit better than implicit but in my experience a good amount of magic is kind of expected by users to have like that good pythonic experience and so it's it's tough to strike that balance of what is too magical like how am i making too many decisions for the user um yeah so it's a it's a tough place it's a really tough place to be where you're like okay i really want this to work in 99 of the cases but then i also need to make it configurable for the one percent of cases and i also need to somehow document it so that users know like okay you're in that one percent case of you need that to be set um definitely the whole keyword versus text is an area i would say that that's probably one of the biggest ones that is like okay you you need to know a little bit about elasticsearch at this point in your juncture of uh using elon like you can't just know about pandas and then have all of the features of elon work perfectly if you don't know how to do like what types elasticsearch has and then what the differences between those are so yeah there's still a little bit of like you need to know elasticsearch to do elon like to its fullest potential but i would say that like if you just have a data frame and you just like throw it into elasticsearch like you'll still be able to be kind of dangerous with that and the nice thing is it lets you like iterate and you know if you want to just upload your data like a small subset of your data just try it out and then okay that looks good now let's upload all of it like that's also a totally valid use case so yeah and this is why i love doing these live streams in the middle of the day around everyone's lunch break we actually have people from the machine learning team that get to come in and provide some input so ben tells us the default to keyword is we had to choose a default so one you had to pick something pandas filtering on text is usually equality which you know super friendly for keywords and data sets for supervised models not natural language processing is usually keyword-based so like everything else is pointing towards keyword elasticsearch uses keyword by default it's kind of a win-win to go that route seems obvious but i will also mention he said the idea of switching to text based on length and number of spaces is a nice idea to which i will respond with remember all of our clients are uh open source and people are invited to contribute go to github.com elasticslash and from there you can make your opinions known but let's uh yeah i would say that our like community can contributions for elot have been like wonderful uh i have a lot of fun every time i show up and there's like a new pull request or something so yes please if you're watching and you want to try getting into open source i don't bite i promise i'll be nice to you okay i don't know i've made contributions to your your clients before don't give it away don't give away the ending no it's definitely a good time and everyone's super helpful with that um the last part of this so we're using pandas to elon which is doing a very simple thing it is taking a panda's data frame and it's turning it into a body of content some docs that we're going to upload into an elastic search index and you have this yes if exists replace what's the default behavior of like if i forget that line because i feel like i've done that before and i didn't get the results i was thinking of yeah so the default is it'll just fail so it'll just complain um so this esf exists basically is saying okay if there's already an index inelastic search that you are trying to insert data into like what do we do so the default is to fail it just raises an error that says hey there's already an index here you've got to do something um either delete it or set esf exists to a different value so if you in this case replace it does exactly what it says on the tin it'll delete the index and all the data that's in it and then replace it with whatever data you just passed to pandos to elon and then there's another value which it's not really a great default because it's you never know like when you get past a new data frame that has like some slight changes you can't uh like merge the mappings essentially like if there's a mapping difference you're gonna get have a bad time um but the other one is append and so like if you have data that you've added to elasticsearch and now i want to add some more you would do esf exists append and it would append the data to it and i would also check to see that the mapping matches uh reasonably well i don't know if it's an exact match because there needs to be some sort of like there's some massaging that happens on the elasticsearch side for example if you say that the mapping is like a date time or an image or whatever but your data is something else it will try to massage it into it and elastic unless you tell elasticsearch explicitly to not do that but it tries to be helpful so i like that and and yes that is i think that with this what we're realizing is usually this this is getting the data in this isn't this isn't working with it so you know you're not to my knowledge you're not making a bunch of mutations um by bulk indexing things over and over and over again i mean if you're going to make a change you're going to make that change but it's not going to be like oh i took 5 million documents mutated them all and then re-indexed them on top of the existing architecture with the new version update like i i think that the smarter role there is to just re-index have it be you know version one again and just call it a day um and and that said one of the things i do like about elon which we i want us to kind of start segueing into is when you're working with the data you aren't limited to just one index you know just like with elasticsearch we support index patterns so you can say i'm going to grab you know data once a week and it's only going to be that week's worth of information and just index after index after index after index and then allow elasticsearch to say all right i'm gonna do a search on these you know 52 indexes that we've pulled in this year and let it do the work for you you don't have to worry about like oh i need to constantly overwrite my existing data or append to my existing index on you know one after another every single week yeah definitely and so there's this concept in elasticsearch called a data stream which is basically exactly that it's how can we package tons of indices that have like some sort of time or like freshness component to them uh pack them all together but then make them as easy to use as just a single index um and so there's some more testing i need to do for elon specifically with data streams but in theory they work right away out of the box and you just point at it as if it was an index um and so that's like a really nice thing that you can do especially if you have data that's like continuously growing but then you only really care your care about the data scales with how old it is or how fresh it is data streams are a really great uh like solution for that problem and you'll actually find that quite a lot in elasticsearch so like things like logs and metrics those are big use cases for elasticsearch so like data streams really big part of that and so in theory you can use elon with that just the same way you'd use it with just a normal index with static data in it so yeah so let's let's take that as a way to kind of segue into like okay we've got our data into elasticsearch by elon how are we now working with that data still using elon sure yeah so i have if i scroll all the way up to the top is this text size or that might be too big is that text size fine uh maybe a little bit smaller a little bit smaller okay how's that okay that's good okay so this is just a jupyter notebook that i have running and it is connected to that exact same cluster that i just uploaded that data into so this is kind of just showing off that hey yeah the client is actually talking to an elasticsearch cluster uh and then this is kind of just like a how that uh mapping that index mapping looks so you can see all of these like the text that we overrided for business all of these other like types that it just decided so as you can see that jay said the keyword is a it's a popular one um but those actually all make sense to have this keyword and then the settings that's just normal elasticsearch stuff so now let's create our first elon data frame so that was pretty uneventful all it does is you give it a client and you give it an index pattern so in this case this is just a single index that we're working with nyc restaurant one uh you can give it like a pattern of indices so like wild cards you can give it a list of indices to work with it'll work with them as if it's one index it does check the mappings to make sure that they're all compatible with each other otherwise you're going to run into problems where your data is just like all sorts of different shapes and then so if i just like run this data frame to just kind of like show off what the wrapper of uh of this data frame is you'll notice that this is like this looks very similar to what you would find if you were just like a data frame from pandas and it fully native my scrolling is not working fully native so you can like see all of the different columns which is kind of nice integrates with jupiter and then so if you go to the bottom of that it's the same thing like you can run info on it and you can see like okay here's all of the pandas things that have been figured out from the elon to pandas side so like this business field that was text in elasticsearch is object because pandas doesn't have i think they have some sort of native string type now i've seen some rumblings about that on twitter but i haven't used it yet so i don't know anything about it and then you know standard stuff like date times integers and floating values those are all there and you can see this is the size of like all of the data on elasticsearch it's almost 100 megabytes and then in terms of the memory usage locally so like all of the storage here it's only 64 bytes so very small the data is all living in elasticsearch we're not pulling anything locally just by doing these queries you only get the results do you know if you just did that read csv from pandas like how large the memory usage would be there i mean would it be the 99 megabytes that exist let's see so let's do if i were to do just pull that data to locally let's try that so this method here now you're showing off so this essentially just asks uh elant to say hey i've that data that i'm looking at right now through the window that is like into elasticsearch can you just hand that to me as a pandas data frame so this might take a while i don't know we'll find out it'll take a little bit because it's got to download 100 meg of data at least because that's so that 100 meg is not like total size so i'm not sure what this is going to be it's just like what the shards are reporting as how big it is so it might take into account replicas as well because that's a elastic search concept that allows for like high availability essentially if your data if one node were to go down then it would immediately be able to serve from a replica as opposed to just being like an outage so yeah this might take a second well while we're waiting for that let's let's just keep going um one of the things i wanted to kind of highlight there was that we we get the idea and this is something that i often emphasize and i think a lot of data scientists are looking for these solutions especially when they're starting out even when you look at you know modules like desk and kind of those things is what happens when i have too much data what happens when you know i'm trying to load some information in it's you know three gigabytes or as we were talking with george kobar about ilm you know we're talking about possibly terabytes of data like if we need to work with this data i don't want to have to load all that into memory like i don't like i don't have a terabyte of memory to load like and in this we're looking at 35 megabytes compared to what was it in elon it was 99 so it's probably counting the two replicas as my cast okay so i i think that the the good starting point is i have a bunch of data this is just memory it's not memory plus storage so like unless you just got terabytes sitting you know on your computer at home that you can you can just lend and have it being queried by pandas all the time you know in that case more you know more power to you but to me i love letting you know a server where i have ilm policies where i have all of this thing all of these things to make sure that i'm keeping the data that i need to keep and i'm managing my storage efficiently and then on top of that i'm still able to interact with that data as if it were a pandas data frame yeah one of the cool things about elasticsearch is like the horizontal scalability and the ability to just be like okay i have another cheat machine with hard drives plugged into it my cluster just got bigger and then like things working pretty well because you can even tier individual machines to be like oh i want this to be my hot data the data that just came in maybe i have that hooked up to like ssds or something and then you have like cold data or warm or whatever you want to call it where you have it like hooked up to spinning disks if those are even relevant anymore and then like having like ultra like super super cold where we have it called uh frozen and like searchable snapshots those are other elastic search concepts um where you actually have the data living like on s3 or google cloud storage and then there's like a cache but it's like still searchable and so like if you were to try to access that data like it would take a little bit longer but not that much longer it would take minutes as opposed to milliseconds and so that data is like almost infinite storage right with s3 and google cloud storage so like the idea of operationalizing all of these different like classes of data like access and storage and how much ability you have there elasticsearch is kind of really good at that and that's something that is really cool and so if you have like that tons and tons of data use case elasticsearch is a pretty good fit for that and then elon is just like icing on that cake basically where you can keep that same api that you really like with pandas and all these other data data science uh libraries and just use it natively as if the data was on your local machine um yeah there's there's a lot there it's it's quite fun a lot of people seem to like the magic so and to double down on what on what ben said i'll bring it back up you can do two pandas after you do your aggregating after you do your querying so again if you're working with you know a terabyte and a half of data and you only need that small segment you don't have to pull everything down and then start running pandas uh you know methods on it you can say okay elasticsearch give me exactly what i need okay now that i've got it let's just make this one part of data frame and honestly i don't know if i've ever used two pandas um usually i can just work with it being in um a data frame that's supplied by elon i know that there are a couple of things that pandas can do that elon doesn't do yet but i think that the need for two pandas is becoming less and less you know you know every single week every time there's an update it's like oh we added these you know all of these modules to make it more complete with what pandas is doing right yeah so elon has this whole like idea of how it is going to transform what the data that's in elasticsearch into a data frame and so any time that you kind of like move away from that idea so for example the example that i always use is transpose like rotating your data on a diagonal like that is not exactly something that happens in elastic search like that's definitely not post-processing action and so we try to push a lot of the post processing stuff like not as applicable to elon and instead like you just have to opt in to the like i want to pull my data locally now um so instead of you know being too magical and then you accidentally like trip over a landline that pulls all your data and does all these operations and becomes this super complicated thing instead you have to like opt into it and say okay now i need this exact pandas api to do what i need but it's not it's not implementable in elasticsearch and i'm not going to even like expect people to have this implemented elastic search because the the paradigm that we use is essentially a document is a row and a data frame is an index and so any and like fields in elasticsearch are columns in a row and so if any time that you deviate from that then you're gonna it's not gonna be implementable essentially so like i lock is a common one that gets asked transpose is the one that i use yeah so anytime that you run into that that's kind of where you'll have to start using two pandas so here's kind of like an example that jay was actually just mentioning a whole like filtering of your data and like not having to pull it local to do that filtering so this is like give me a sample that has 10 documents over this almost you know 200 000 documents so if you just want like a sample of your data that's really quick here's 10 documents right here with all the full fields in them and then if you were to run it again you get a like another random sample and then there's also you can do uh if i just did grade so if you only want you know for example i only want ones that have a grade of b and so this doesn't even pull it locally i'm just applying this filter and i'm running it and it'll show you a little preview of the data that would come back so like all of these for example have b and it cuts it off because that's how a wrapper in data frames would look like but see a lot less rows are being returned and so if i were to run two pandas there that would give you that result but then you can also combine these into like uh let's see if i remember this rocks there we go so if i put some parens just so it's easier to parse mentally i believe that it should work so this will give you even fewer and so what this is doing in the background is it's essentially uh maintaining a query dsl which is like the query language that elasticsearch uses it's maintaining that and it's also maintaining a task graph that has like pre-processing run the query and then post processing and so every time you do something to your data frame it kind of like adds on to that like those two values there the task graph and the query dsl and then when you actually say okay now i want my values it'll execute that task graph which includes making a search to elasticsearch or an aggregation and then post processing all those results into pandas and so anytime that you do like an aggregation because the results are like smaller if that makes sense like you're not going to get millions of results in theory from an aggregation you're just going to get like the subset that you wanted because they're all collated together anytime we do that we actually return you back a panda's data frame because we can right there's it's it's a lot safer to to do that because your data is a lot smaller but in this case this is actually like an elon data frame but it kind of has that immutability part of pandas that people like where like if you were to you know do oh i want to change my data frame i want to like assign this as data frame two like you could totally just do that and then data frame two would then forever have this filter applied to it uh that's something that elon also does it's it's like this is goes back to what you were talking about earlier actually about um combining the aspects of both so that they like mix together nicely in a way that makes sense uh immutability is another one of those uh things so yeah i've been rambling for a while so go ahead no you're good um and i think with that you know we can even kind of go into the idea of querying because you have again we mentioned this earlier there there are kind of two ideas when you use the word query when you're talking about data frames and with elon like you have the pandas version of query which is a very boolean like effort very similar to what we're doing now in a sense you are querying i want all of the rows where the grade is equal to b in the borough is the bronx um you could turn that into you know df.query was it grade equals b i don't know why you would do that when you could just do this instead but you can but one of the things that i noticed there is you have um a good example of in those violations it everything starts with like violations but you know to be included is this what if it was just something like there was a rat and it's like all right i want to go i want to know like every restaurant where there was a rat sighting you know you you couldn't necessarily say like you know query rat in this row without it being a super taxing effort but you could do it with es query which is applying the power of elasticsearch onto your data frames yeah so i i'll avoid embarrassing any businesses with rap settings so to show off kind of like the full power like this is where elon kind of diverges from just being like okay we're just gonna be a drop in for pandas we're gonna add some more as well that is elasticsearch flavored and so every time that you see like an es prefix that means that it's like an additional thing that elon is doing so if you're looking for like elasticsearch specific stuff you can always start with the es prefix and so this is essentially just adding on to uh when i was talking about the query dsl before it's adding on to that internal query dsl this query which is just like a raw elastic search query but you can do a lot more with elasticsearch than you can with pandas for example geodistance so there's geodistance queries you can set a point on a map and say i want every restaurant that's 50 meters away and this one is just pulling the cuisine description but you can also pull like business and then you can just run it and within that point from 50 meters here are all the entries and the number of times that they show up in this log so kfc burger king all of that and then another one which kind of leans in more into the whole like full text search capability uh so example here if you have just like like this is taco like if you just search for tacos like you could also do this by like oh is the word tacos in the word business and like that's an easy thing to do like the word tacos is there okay fine but elasticsearch is smarter than that it has like a deeper understanding of languages and so if you change the analyzer from the default one which is usually very simple it just kind of removes white space and does that stuff you can have things like stemming so the analyzer is english now and now if i run it it won't just do like a simple is the word tacos in the name it'll also do stemming so you get taco so this is like more of a like data plus typos and things that are similar and dip like same stems so like verb tenses all of that um is is kind of built into it and so you can do this like this feels pretty native you know what i mean like this doesn't feel like a big deal that i have to like construct this huge query i'm just kind of giving it a word and telling it that it's english and then calling to pandas and then i have a data frame that understands english apparently so it it's really kind of cool to like expose those little tiny pieces of the elasticsearch api through these like native feeling uh functions so we're trying to do that more and more but the full text search capabilities are really exciting so absolutely well let's let's actually start wrapping up because we've got about 10 minutes left um i want to make sure that we cover a few things uh one uh first of all thank you so much seth for showing this we still couldn't get to machine learning i i feel bad i did not distribute my time effectively um i guess my performance reviews will uh will show that but that said i know we've got some folks in the machine learning uh a camp at elastic that are on the stream watching now so uh be looking forward to a ping for me and slack that uh hey you want to come on to the stream and show us some of the machine learning stuff that you're doing uh maybe we can we'll bring back elon but again seth thank you so much for for showing us all of this also uh we got a few more minutes left so if people have questions questions about elon its capabilities um about working on with the clients at elastic i remind everybody our clients are open source so i'm gonna throw it up again github dot com slash elastic slash elon if you feel like you have a you know a brilliant idea on how we can improve this by all means check that out but yeah uh seth was there was there anything that we can't leave without talking about uh as we start to wrap up i mean i guess it's fun to always know like what's coming next right uh i think like a couple of like the big things that are that i think are really important for elon that are coming next one is we kind of have our like checklist for getting elon to general availability i know that there's been a couple of contributions like working towards that because right now it's in beta still i would say it's a pretty stable beta i would say it's been in beta for a little while um and so like getting over that hump of general availability and making sure we iron out all the problems that still exist or like all of the things that we haven't resolved and uh yeah i think that's the big one and then also the like keeping pace with all of the awesome things that are happening in the machine learning space for elasticsearch like so if you are planning on using machine learning for elasticsearch you can just like watch elon as well and we'll probably be going you know at the same pace that elasticsearch is so we're really excited to integrate some more of the apis there as they develop and as new ones are created absolutely and and i will i will say as someone who's been beta testing a lot for basically the entire time i've been in elastic like that is again it's one of those things that if i'm if i'm uploading a lot of data from a csv file i'm using elon if i'm having to do a query and then like make it presentable to people outside of my organization where they won't have access to kibana i'm using a lot in like a jupiter notebook like it is one of those tools that's just so versatile and gives you the ability to service the data that you're looking for across so many different applications like i use it for web data i use it for just data data data i guess i don't know like i've used it for like a lot of public data and i've recently just got to start playing around with some of the geolocation stuff and figuring out like hey if i want to implement a search where i'm searching around an item you know adding this incongruent to something like google's geocoding api to where you can say show me all the taco joints around this area and then get it back in a nice table format that you can then go to and in this case even look at some of their uh reportings of a certain uh pizza rodent but you know it's at the end of the day it's like you can come up with so many use cases to just kind of pepper elon in and i always learn something new every time we have a conversation around it so that said i think we're going to have to do another conversation around elon in the future would you be down for that yeah definitely yeah have me on anytime awesome well i'm gonna take the last couple of minutes to get some of the admin stuff out of the way um everybody thank you so much for listening if you have questions around ilant you can ask seth i'm going to throw upset's twitter handle there i know he's pretty active on twitter and pretty helpful as well also you can ask them in our discuss forum where we have technicians elasticians and all the other issues that are out there wanting to help out again i like making up for it it's fine and of course if you want to keep up with what's going on in the community we do this weekly live stream every tuesday almost every tuesday currently every tuesday you can go to community.elastic.com make sure you sign up for the amer virtual user group where it's not just this live stream there are plenty of other events that happen all throughout the week i think we have an upcoming one on limitless xdr uh in the future like in a couple of days but next week we're going to be talking with enrico zimwell from the php client which is awesome because i know there are a lot of folks out there that are still rocking php still rocking laravel and i want to make sure that we have some content for them as well but uh again seth thank you so much for for being my awesome guest and uh until next time everybody have a great week
Info
Channel: Official Elastic Community
Views: 158
Rating: 5 out of 5
Keywords:
Id: w8RwRO8gI_s
Channel Id: undefined
Length: 56min 22sec (3382 seconds)
Published: Wed Sep 01 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.