Haystack LIVE! Natural Language Search for Solr and Elasticsearch

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
and i am recording welcome everybody uh welcome to the haystack live meetup um thank you all for coming um today we're going to hear from my colleague max irwin our natural language search for solar and elasticsearch uh it's going to be a great talk and he's got some some things he's created to show you as well which is going to be exciting uh just to add we've got some of our open source connections training coming up soon uh there's actually a solar thing like a relevance engineer course next week there are still some tickets left for that if you're if you get in there quick um we have a learning to rank coming in a few weeks on the 20th of october that teaches you how to use learning to rank in uh you're on solar electric search and then we've just announced a course on natural language search too which is linked to this this talk on the 17th of november so do check out our website for more details um as usual uh if you have questions please drop them into the zoom chat and i will moderate those when we get to the end of the talk i'm sure max would be happy to answer your questions uh you've all very kindly muted your microphones uh which is useful if there's any background noise um it's up to you whether you want to put your video on or not and without further ado i shall hand over to max for today's talk thank you very much uh thank you so much charlie um just one more note we are recording if you didn't catch that already so we upload these to youtube uh if you have to drop out and you can't catch the whole thing you can come back and watch this recording or any of the other recordings that we have um thanks so much for coming uh the title of this talk is natural language search for solar and elasticsearch i uh this is a pretty easy title uh other other alternate titles that i considered were natural language blunderstanding uh because it's so ridiculously hard and natural language for understanding because who doesn't love nlp so i'm a max irwin the guy talking to you right now you can grab my contact information there i am going to talk about a lot of stuff we have technically an hour and a half ish i don't know how much time this talk is going to take it's going to be a talk and then a demo of some stuff but just some background on how i got here i did some search stuff i've done software development for a while did some search stuff and i got into nlp around 2016 2017 when i started figuring out how to fix a bunch of search problems that were really annoying uh then i joined osc i met osc at a lucene revolution when it was still called revolution i met eric and doug back then and uh i really liked what what we were doing and then they announced the first haystack so i gave a talk on some of the research that i've been doing about vocab extraction and that talk the slides are available there's no recording um but that's still a popular talk and i refer to that pretty often so that's really how i started my research because i was i was mostly interested in taxo development and extraction using nlp tooling and then i continued my research and i joined osc and i use nlp for clients here and there when i have the opportunity but i'm always it's always interesting to me and i always try to solve search problems with it because it's a very effective tool and based on my experience and basically the explosion of stuff that's been going on in the nlp space in the past several years i developed a course i gave i gave it in pieces to a client of ours a rather big client and we are offering it publicly now as well this year i started working on a project uh that really renewed my interest in the talk that i gave at the first haystack and uh a whole lot of other stuff it led it led to an interesting direction and we're all here today uh so thanks for coming and let's get started so we have a pretty ambitious agenda for this talk uh what are the problems that we have as solar slash elasticsearch users um there are people from both camps on the meeting today please no no fighting you know we're all we're all friends uh but solar and elastic uh are both both based on the scene as most people know and there's a lot of stuff that goes on in the scene and we've picked up a lot of habits over the years to try to do things in in fancy ways but we get we get stuck so what do we want our search stack to be able to do and how do we express those wishes as existing uh solutions that are out there but they're not really tied to search just yet so how can we take those and put them together in in a way that complements what we're already doing and doesn't break a whole bunch of stuff at the same time so the big problem statement is that full text search fails for many use cases even when tuned um so what do i mean by that so if i i like to picture lucine as like this amazing math nerd who's been handed a whole bunch of books and they're like but i i didn't pay attention in english class i spent you know i got my phd in computer science and math um so don't don't bother me with these text things i i want to parse strings and numbers so if you send a query to lucine shirt without stripes i want to search this query uh luciano will just do matching and and waiting uh with complex mathematics and you know and data structures it doesn't really understand anything about that and that's not a surprise everybody knows that already um but how do we solve this problem because this is a common thing that people search for maybe not this particular query but you know people want to express themselves in even these short fragments of sentences you know not just not just the key words but even just a little bit of hint of like no this is what i want so how do we solve those problems um there's a whole bunch of other problem statements and i'm going to spend a little bit of time talking about all the problems that we have uh it's very important to express these things because we all live with them you install solar or elastic you index some content and you start searching and it stinks and you got to spend a whole bunch of time fixing problems and eventually you mature uh and it turns into this big configuration mess with a whole bunch of stuff going on so you really have to spend a lot of time to fix you know seemingly trivial problems that i that i said before because lucian doesn't grok language and the other problem is that you know engineers we there's all this obscure knowledge scattered around the internet and in people's brains and you know we go to talks and we you know look on stack overflow and we go dive in the documentation and we look at the actual source code that's in github and we try to piece all this together to how to solve our problems and there's this other camp there's all these nlp libraries out there you know there's company hugging face and there's the explosion ai team with spacey and they're doing all this cool stuff but it's low level uh from our perspective because you know we spend a lot of time doing configuration uh in search we don't spend a lot of time actually writing code uh in the in the search engine layer we do in the service layer so it's not really trivial how to put these things together um and of course you can give your money to amazon or google and then you know you can say okay i just gonna send my query to you guys and give me back something good uh it's expensive it i do you know if it's working if something goes wrong how do you debug it are you willing to give up that control so a lot of us you know especially osc our mission is to empower search teams we talk about making sure that people have the tools that they control and they understand so that they can solve the problems that they have for their customers so a lot of teams end up building or buying all this stuff to go along with your stock solar elastic search uh system right so you need your service layer you gotta you know cobble together some auto suggest stuff you gotta write all this query logic you gotta do your content enrichment and text pre-processing you to worry about your taxonomies and your synonyms all the all these things and so many teams end up either building these things themselves or buying these things one at a time uh and cobbling together something custom for themselves and then unfortunately there are a lot of teams out there that don't really have the time or the budget to be able to do that so they slap a search bar on their page they stick some content into uh into a listen lucien backed engine and you know they're like okay we got we got a search working we're getting 10 links we got some facets next you know but then their search is terrible and they don't have time to spend on improving it right so that's that's a huge problem that you see in the field and now we get into the technical problems right so the scene normalization analysis chains uh if you want to deal with text you kind of just can't um you do your best and these tools are very very fast uh the you know the analyzers are easily configurable and you can do a whole bunch of stuff with them in the pipeline but there's something missing and everybody knows there's something missing and you do all these kinds of work arounds and you're like oh don't stem that term and oh this is you know don't split on this hyphenation this is a compound word and all kinds of other funny and interesting stuff that we have to deal with day to day and if we want to get advanced and we want to write some plugins you know solar you can have open nlp inside of solar and you can stand up a you know an nlp based plug-in in solar but then like every time you re-index well now you've got this really heavy thing that just slows down your re-indexing right so it's like this double-edged sword it's like you can't fix the problems in your search engine but you want to the other problem is that all the nlp research open source nlp research is really advancing in the python world it's not advancing in java there are java tools out there and they're good but if you talk about all of the things that are happening and you see all the activity it's pretty much python based because of the tools that that python affords and all the deep learning frameworks that are out there uh supporting python in the python community right so these are the tech problems that we got to deal with so let's look at a typical analysis pipeline okay you've got some content it goes to your html strip care uh maybe you want to do a pattern you know care replace filter you want to you know maybe find some you know currency or something and you know fix the dollar sign or the period or something and you've got this whole bag of regular expressions and now you've got to tokenize and you've got to make sure that tokenization is doing what it's supposed to do in the right place then you got synonym expansion and you know what if your synonyms are picking up the wrong things in certain places and now you got too much recall and you got to deal with that stop word removal is like a huge pet peeve of mine i just hate removing stop words it's just you're just throwing away stuff in your text which is bizarre to me um that it still happens then we still have to do this in certain cases lowercasing ain't no big deal uh but that's a funny graphic uh and stemming everybody knows the the pitfalls of stemming is just horrible you get things that end up matching that really shouldn't uh in the end so this is what analysis pipeline looks like these days uh you open up most elastic or solar configs and this is what you're going to see for a field a text field so what if you know we've got this what if and there's a whole bunch of words in here there's a whole bunch of things uh what is semantic markup awareness what is statistical pattern recognition what is co-reference resolution but we have all this stuff so what if you have html and you don't want to just throw out the html but what if what if you see something and it's like okay well there's this term is in bold in the markup well that's important should we just chuck it out uh what if we had a way to capture that and understand and say well this is a bold term so we're going to put some weight behind it what if we had instead of just a whole bunch of bag of regex's in our pipeline you know we could actually do some more advanced pattern recognition with stats right so there's a tool called duckling which is uh from but now it's the facebook team but it was from a team called the wit ai team and it's a it's a haskell service um and it does a really good job of finding patterns it is a bunch of uh reg regex's but it's mixed in with some naive bayes classification and some other interesting things to get some better accuracy out of it co-reference resolution is an open problem going to talk about that a little bit more but that can really help in a lot of use cases tokenization uh we're just splitting terms on white space typically maybe some other punctuation and some other things what if we could like you know understand what sentences and words looked like and how they relate to each other and what their context was and how and what if we could use that for for further things um synonyms what if we had a knowledge graph and we wanted to extract entities and use those in our search and then what if we wanted to classify a document based on stuff i uh you know you can use term vector classification tools in solar and elastic um they're out there but they're the result of a lot of the normalization that you've already done in your typical analysis pipeline and you know you can do some base classification of some other things but what if you want to do some really advanced stuff you know based on some transformers what if you want to write your own deep learning model you want to use like jensen which is a great python library for doing these things stop word removal i hate removing stop words what if you could just wait terms like the term the is not worth much so don't throw it away but just say it's not important so if you get a match you're not gonna you're gonna you're not gonna boost on it and maybe you'll de-boost on it um what if you don't have to stem anymore what if you do lemmatization uh we'll talk about that in a minute and lower casing that's you can do that uh blah blah blah what so i didn't finish this slide apparently that's funny um but you have all these text things uh that you want to solve what if somebody searched where for a 1992 supreme court decisions are you just going to match the text on those or are you going to try to understand well i'm going to filter on 1992. how do i do that and there are a lot of other examples where you have these like short not whole sentences but queries like natural language queries that you want to deal with so uh stemming again we got main points to m-a-i-n and then if you search for main m-a-i-n you're gonna hit get a hit uh and then hiking you're just chopping off the ing limitization you're actually looking you're actually using a vocabulary to look stuff up and this is available in most nlp libraries i'm going to talk a lot about spacey i use spacey a lot uh but spacey will lemmatize nouns and verbs and all kinds of stuff for you and it works really well and it's really fast if you do this in the index and i used to do this in the index my previous company we had a huge solar index where we would do limitization and it was really slow because we had this gigantic dictionary that we'd have to maintain and then we'd have to parse stuff as it was coming through the analysis chain and it would just bog down when we were doing a re-index so it would have been nice if that we didn't have that problem so i'll talk about how we're gonna solve that a little bit later all right co-reference resolution so this is a big one um so i'm gonna give two two cases for co-reference resolution as a tool uh the first tool is since i do a main area of my research is vocabulary extraction i want to see the relationships between terms right so i've got this sentence when choosing to install solar there are several options it can be downloaded from the apache website or the docker hub right so what is it so now i've got this thing it now there's a lot of meaning behind this term but i don't have a way to hook onto it so i want to have uh i wanna have predicates that are rated related to the noun solar as part of uh some vocabulary extraction work right and i wanna i wanna see how terms relate to each other well in the first case i just get the the first one install solar right so i can do that but now if i do core reference resolution i replace it uh with a you know i've got the pre-nominal i replace it with the with the noun and now i have solar is replaced in place and then i can do further work on it and now i've got this further relationship solar download right so that's really interesting another really interesting uh byproduct of co-reference resolution is bm25 so bm25 is meant to measure the aboutness of terms in documents you want to know how important is this term in a document compared to all the other documents and give me the ones with the highest score up top right so uh let's say we've got these two documents so some search teams use solar others others use elasticsearch or vespa right now i've got document two which is the last case intrusion to install solar yada yada so if i query solar here the first document comes up as the top because bm25 says so um so i'm i can see that the saturation is much lower on the second document this is a much longer sentence the term frequency is the same but this document the second document is about solar the first one mentions solar it mentions a bunch of others but the aboutness is really in the second document right so now if we do co-reference resolution this fixes that bm25 issue so now we've got term frequency of two we've got a higher saturation and now the second document is going to come to the top as we as we want right so that's a really interesting tool unfortunately the accuracy isn't amazing um uh i believe the leaderboard right now is about 79 f1 is the f1 score uh from a paper that was maybe six months ago to a year ago um but that you know people are working on on this a lot and there are some tools that kind of do an okay enough job where you can fix a lot of annoying issues okay so another tool we have is payloading for linguistic weights so this is not an original idea this is pretty common a lot i know a lot of teams who at least discuss this whether or not they do it um there are a lot of examples out there mostly in solar i haven't come across too many elastic examples but the idea here is well what is if i'm searching for chocolate how do i how do i identify chocolate right so i've got this first document nestle chocolate milk and then i've got nestle milk chocolate now these are equivalent in uh in bm25 right if i search for chocolate i'm going to get the same score for both of these but the latter is chocolate i'm searching for chocolate i'm not searching for milk and this is a concept called the head noun so here we will payload a weight based on uh its semantic importance uh in the in the text right so now i'm going to payload 2.0 because this is a this is a head noun so i'm going to give this a little bit more juice and then this one chocolate that's a 1.5 because that's a modifier i'm not going to give it as much so now i'll get this this document will have a higher score because it's you know it is chocolate it's not milk named any recognition uh i'm not going to spend a whole bunch of time on this this is covered extensively but there's a whole bunch of things that we could do with ner um and a lot of people want to do it but there's a lot of problems with it at the very least it's pretty easy to just identify noun phrases and people and locations you may not be able to link them to a knowledge graph or a database successfully without doing a lot of manual work and cleanup but at the very least you can say well this is an entity it's important and it's important in a certain way and i'll show some examples later but basically you can say well give me all the people from a document and stick them in a people field so i can facet on the people or filter on the people which is a common problem or locations um text and a text converted to numbers what if i say like 12 i spell out 12 and i've got like the dollar sign one two those are the same and i want to be able to find them in the same way so that's a recall problem and that's pretty important knowledge graph extension the knowledge graph expand extraction uh this graph is an example that i'll i'll show a live demo of this later uh based on a tool that i originally wrote for that talk in 2018 called skip chunk and i've had the opportunity to refine it in the last couple months and it's coming along pretty well so it does some really interesting things we'll see a little bit of a demo of this later but basically it finds subject predicate object triples in instruction unstructured texts and it stores the relationship in a search engine it doesn't use a graph database it just uses solar or elastic whichever one you want to use and then you can navigate this graph and you can see term weighting relationship waiting and very interesting things uh to give your language more glue quarry intent and rewriting so shirt without stripes if i search for this thing maybe i'll have a rule set that i want to follow i want to say well i've got some prepositions and i've got some rules that follow those prepositions if i see without and then a noun i'm going to be like i don't want that filter it out get rid of it i don't want this so i will rewrite the query and say i'm searching for a shirt and i don't want stripes of the style of the shirt and this should be a flexible tool for you to use it would be really nice if you could do this in a really really easy way but it's not easy right now it's hard so this is like a dream analysis pipeline right i've just thrown in a bunch of examples there are a whole lot of other things that you might want to be able to do you could do you could do and i'm going to say it i'm going to say it i'm going to say bert i'm going to say the bird word you could do the burt thing you could vectorize your whole text you could stick it into a vector field you could vectorize your query and you could do approximate nearest neighbor if you wanted to do that a lot of people are trying to do this it's hard there are a lot of problems with it um we'll see how this plays out but there are a lot a lot of smart people working on this problem you know google and bing and other folks you know they got plenty of money so they're throwing they're throwing it at it and they're having a lot of success but if you don't have a full-time data science team who can do fine tuning specific to your domain uh and also a full-time devops team that can worry about model gpu performance uh and where you're gonna store all those juicy vectors uh you know you're gonna have some trouble if it's like you and three other folks who are trying to engineer a customer-driven solution um so all of those things that i mentioned you know you want to do them but if you do them again in solar or elastic analysis chain you're just gonna slow everything down it's gonna be like oh my god i have to re-index please don't maybe re-index i just wanna change the rating for my product i don't wanna re-index and deal with everything i want to keep attributes up to date and i want to change stuff but i don't want to reindex every single time i make a tiny little change because it's just going to take forever so i want to do enrichment outside of the search engine and keep it somewhere so that when i do re-index i only have like some very very fast things that i have to worry about uh solar elastic tokenizing lower casing and you know if i got some payloads in there or something else you know delimiting is pretty quick right maybe i'll have some other things in there too but all of the big like ner and co-reference resolution and all that fancy stuff you do not want to do that in your in your field type that's going to cause you pain so there are solutions out there but we are not academics we are practitioners and anserini is amazing um but ask me how long it would take me to get a robust anserini solution in place if i just want to use some quick tools i don't want to read an academic paper to figure out how to connect it to solar cloud so what do we want as practitioners we want practical tools so we can quickly ship better search quality we have a lot of tools already we have things like cupid we have rated ranking evaluator we've got a bunch of stuff that we're using to measure search and now we want to improve it most of the improvement happens either in your custom service layer or in the analysis chain i don't i'm so tired of seeing going to websites of tools and saying it's going to automatically fix my relevance for problems for me you have no idea what my relevance problems are please don't assume that you can fix them so if i go to your fancy cognitive search tool i know that you're not telling the truth when you say cognitive intelligent search is going to solve all your problems it's not i still got to spend a lot of time and now i got a black box to deal with so that's trouble i want it to be i want something that is a natural extension of my existing search engine i want something that works with solar and elastic it works with the other tool sets that i'm already using and it's non-intrusive and i can have additional signals and and analyses that i that i can use if i want to use them you're not forcing anything on me and it's flexible so i have been working pretty hard in the last couple months on a service that i've developed called hello nlp this is an open search service uh it's a it's an example service um it's not the end-all be-all but it embodies a lot of the things that i've been talking about where it's been pretty tricky to marry these technologies together in a nice way that does the things that i that i mentioned on this wish list again the these are the things that drove uh kind of my dream service you know if i was gonna write something from scratch this is what i would do so i did it um based on working with a lot of search teams and seeing a lot of the same problems all over the place so i open sort i press the open i press the public button on github this morning so now it's public so you'll see you can go to the link here i'm going to give you a demo right now but it uses a bunch of things so it gives you out of the box nlp based auto suggest it gives you pipelines for nlp that look very similar to your to your elastic or solar pipeline the configuration looks very very uh very similar you should be very comfortable with it if you're if you're used to configuring those tools um and it does uh query and rate query enrichment that lets you pass in your elastic dsl or your solar query string and it will enrich it in the way you tell it to um so it's just like writing a solar elastic query like you're used to there's no special magical new query language that you have to learn there are a couple decorators that i've added that can help but you don't have to use them okay so now i would like to offer this usb stick to the demo gods please demo gods be kind to me while i give a demo all right so i'm not going to start with the tool itself i'm going to start with splayner if you're not familiar with splayner it's an open source tool that works with solar and elastic and it lets you craft a query this one happens to be in solar and it lets me craft a query and it lets me see uh the scoring and and how the scoring is happening and it lets me tune so i've got this query search quality it's anita's max query i've got some query fields with some boosting and i've got some payloads well this is interesting so what's happening here is i actually have payloaded some linguistic weight and i can hook into that and this is a this is a solar sub query nothing special here this is out of the box i can use this but i've enriched my content with hello nlp before this so now i have the title field and the content field are payloaded and i can use the payloads to boost right this is a common example you can grab this stuff off the internet so if i go down sorry i'm going to change this i'm going to say payloads and i'll rerun my query so now i can see what's really happening i have this as a stored field so i can look at what's being payloaded but i can see all of the all of the fancy weights have been added to my to my query uh to my to my content sorry and then i can hook into that and that's nice so how does this actually work how do i actually get this content in here um another thing if you notice this query i'm not pointing this at solar well you might not notice this because this looks like a soul url but this is actually uh hello nlp url which i'm hosting on my local machine and that's not what you're supposed to see it okay so this is hellonlp so this is the solar configuration for hello nlp i also have an elastic configuration right here so this port at 5055 is running a lot running against an elastic docker that i have and uh this port 5050 is running against solar and i think my elastic is version seven nine and my solar is eight four i think and on the right you see this this interesting thing the pipeline the pipeline is actually the same for both i have i have them both configured the same way i've got a plugins path i've written some python plugins that i can add it's very easy to write plugins for this thing and i've got some analyzers so now i say well i've got a payload analyzer so this looks awfully familiar this is an html strip this is a tokenize and this is a payload but this isn't what you think it is html strip is actually using the lxml library in python to parse html not just throw it out i did some i spent a little bit time looking at the uh the html strip care filter in lucine and i got into some interesting code and it's actually a flex grammar that was written specifically for stripping html html out for a wide variety of use cases um and i took all the tests and i ran it against dell xml and some worked better and some didn't work as well but uh it it produced some some great results and it's really fast lxml is i think it's c based um that was wrapped by python and uh it's pretty straightforward to use this so the this let's i could parse the xml or the html and i could do the things that i was talking about before so i don't have to throw it away i could do the i could look into the bold tagging and i could do some other things with it and i could i could add waiting to terms the tokenize is not like a white space tokenizer or a standard tokenizer this is spacey this is a spacey pipeline so if you're not familiar with spacey it's an nlp library that's written in python actually psython and you can pass text into it and you get cool stuff back so that graphic that you saw before uh of you know the relationships between text um that was a spacey tokenize and then the payload the payload is basically looking at the dependency parse for the for what space he gave me and it's a it's attaching weights it's just doing the payload payload delimiter filter factory it sends it to a field that i've configured in my schema and it uh it marks it up and it sends it in there so now i've got some other analyzers too so i've got entertizer and this does uh this does entity recognition this actually pulls out people from text limitizer so you can limit stuff that's pretty straightforward prepositionizer so that shirt without stripes example we could we could do that and then i've got fields so let's say i want to enrich a document so i can send a document in and it'll take these fields and it'll say oh title i'm going to run it through the payload analyzer and copy it to title payloads i debated whether using copy fields for this api but this this seemed pretty natural you can repeat the field if you want so content goes to both content payloads and uh and the people field so it's payloading the content it's also stripping out the people and sticking them into this people ss field which is a multi-valued string in the solar config and then i've got a query a basic query enrichment which is just limitizing right so i have all these analyzers and this is nice but what are they actually doing so another big thing for me is again i want to be able to use the tools that i have access to as if i were a normal relevance engineer right so if you're in solar you can use the ana the analyze api ui um in solar admin uh there's a plug-in for elastic uh the analyze api plug-in uh which we will visualize but you can get the json back because elasticsearch will happily analyze something for you and spend it back so let's say i've got this text here the quick round fox jumped over the lazy dog great so i've got my analyzer and i'm going to synth the query and i get it back and i'll just make this a little bit smaller so you can see the whole parse but we see okay well here's the result so i've got the quick round fox jumps with the lazy dog and i've got weighting behind this so you can see fox is four because this is a head noun and dog is uh is an object right so the head now gets the four dog gets a 3.5 and the other really important thing that we will have to deal with is how long does this stuff take right i challenge you to figure out exactly how long each pipeline stage stage takes in uh lucine unless you really dig in and find out so i wanted to be able to know very quickly where the bottleneck was in something that i'm doing so it turns out that tokenize is actually quite expensive in this case it takes 42 milliseconds to tokenize that text space works well in that it will cache things for you and if i run this again it might be a little bit faster and it was so now i've got 12 seconds and now i've got this parse and i can use this parse so now i want a payload and the payload just drops the waiting in there and i got some rule sets and now i'm going to show you some code which is really scary but where's my spline so that's that okay so this is my payloader i've just defined some weights based on uh open classes so if you're not used to nlp stuff check out universal dependencies we'll tell you a lot about the parts of speech and things like that it's kind of a rough website but the information is there you just got to dig for it so i want to just assign certain things in certain ways so i'm going to bump my part of speech tags in this in in this way so if i see a verb it's going to get a 2. if i see a noun it's going to get a 2. robert now i'll get a 2. if i see the root which is the head i'll add another 2 right which is great so now i see all this stuff and i'm going to give some juice to my tokens and i just go through this and there's all this other code right but it's not that much it's a little bit and this this one comes it comes with it it's not it's i'm not pitching it as a plug-in it's just a standard analyzer so now we have this so let's look at some other things so i mentioned the anodizer before so now let's say i say uh charlie hall introduced max irwin for the talk submit query and oh sweet i thought it died for a second um so here's the result charlie hall max irwin boom this is spacey ner this is pretty straightforward to do uh it doesn't take very long that was seven milliseconds um and now i've got this and i can do whatever i want with it so if we go back to our pipeline i can see that well i've got a field that says well for content i want to do the anodizer and i'm going to stick that output into the into the people ss field so this is the solar stuff but it's also working with elastic so i can do the same thing so this is my again elastic instance i'm pointed elastic the analyze is not touching the search server it's just running the same stuff but i can send index i can send content to this and it will index it for me we'll run it through the pipeline and do it so there's a pretty extensive api for this and it's all here it's fast api this whole thing is a python service uh i'm running it with uh uvicorn but there's docker um so that's the other important thing that i wanted to mention the tooling is important in that it matches the way that we use tools so if i start with something low level i have to worry about the service layer and i have to worry about deployment and i have to worry about all this stuff so i want to make sure i understand the performance and i want to make sure that i can deploy this thing in a way that i would use so this is a docker container if i want to set it up in kubernetes and deploy a whole bunch of these things to get really fast throughput you know i want to have a you know 10 re-indexing dockers and i want to have you know five searching dockers that's that should be easy to do right um you shouldn't have to write that stuff from scratch so that was another goal of this project is to make it really easy to deploy now i'm going to talk about the graph that i mentioned before so we have this autocomplete so i it doesn't go through an analyzer yet it's in uh skipjunk which is a command line library right now but the connectivity to solar and elastic works the same um that's shared but i have indexed the osc blog so if i start typing search so i get i get interesting results but what you'll notice is that these things are noun phrases and verbs verb phrases and things like that you're not going to get all the garbage that you would get with configuring an analyzer sorry configuring an auto suggest or a suggester based off of the normalized text that you would do in a pipeline you could do engramming and all kinds of fancy things in solar elastic but you're going to pick up a lot of junk because it's not it's not isolating the importance of the words so with spacey we know that words are nouns and verbs and adjectives and adverbs and a whole bunch of other things and they relate to each other so we can use that to pull out concepts and put some stats behind them and then relate them so the other thing is that we you know as we parse sentences you know i can say well users so i'm going to click on users and now i'll zoom in a little bit here but this is the relationship graph that you're seeing so i've got users search queries and user search search terms and users have everything apparently but this is these are the relationships that are in my text so this is a latent knowledge graph that i've pulled out and i'm expressing it here and this just runs off a very simple schema that i've created again both solar and elastic it'll work in either [Music] and it stores this information and you can use it in a number of ways uh the i guess ultimate goal of this would be able to edit these relationships because you're going to get a lot of noise and things like that that you're not interested in you know text is very very messy i also want to have an opportunity to add synonyms through some kind of ui because the text is the text it's not doing any magical thing it's not doing any kinds of you know fancy embeddings to relate terms to each other because that doesn't work too well in a lot of cases um so it's just it's just raw text right now um parse with spacey uh so you want to be able to say well learning to rank is the same thing as ltr right that's a synonym it's a it's a different label for the same concept and i want to relate right relate the two together so that's like a really far out stretch goal for me to be able to do something like that and then be able to use this when i'm searching and if i search for something and i want to auto complete and it doesn't do this yet uh but it wouldn't take long for me to wire up something where i type uh users and then i press space and then it gives me um if i don't select one of these things it'll give me you know other nouns that would appear close by to it right other nouns that appear in the same sentences uh or other nouns that appear in the same you know with the same verbs uh that connect the two so it's actually pretty easy to do that based off the data structure that i already have i just haven't spent the time to write up the ui uh the api calls for that exists so i'd really like to be able to do that and that should be really nice because people type and they want suggestions of things that are possible you don't want to suggest something to somebody that is going to result in zero zero documents and you want to suggest things that are going to provide the most relevant content to them okay so let's get into some deeper deeper areas here so i have a couple basic examples um i talked about the payload boosting uh already but now i'm going to talk about query rewriting so if i open this spliner link you're going to see some something very interesting so i actually have two [Music] uh two filter queries here in solar this is a filter query and this is a subquery where this is just a very very basic syntax this is pretty minimal at this point but it's very expressive and it lets me say well i'm gonna i have this analyzer it's called a prepositionizer right now i'm not great at naming things uh but that looks like this so i'm gonna go back to i'm going to go back and look at some more python apologies for the code so i've got a couple plugins that i've written so let's look at the preposition eyes and this is just a very very simple python plugin so you can copy you can take this you can copy it you can paste it stick it in the plugins directory and boom you've got a plugin that you can use and craft any rules that you want to in this case i'm just doing something very simple so i'm calling i have this method proposed and i just pull out all the prepositions and uh the prepositional objects and i say well in the query shirt without stripes without is the preposition the piage is stripe or stripes so i said well give me those and if without isn't the preps then rewrite the query to get rid of these preps and p objects so now i'll just be left with shirt and the value that i want to stick in my ultimate template is the v so if i go back here you can see that i have this dollar v and this looks like a solar query so this is just a very very simple query pre-parser that that does this for you but the the syntax is is very very familiar and i actually stole this stole these parameters from the payloader right so you have f and you have v v is dollar q which references this and then i've got this filter which is i don't want to see stripes so this is going to be relevance without controlling so if i remove this i have this thing at the top right so solar versus elastics for relevancy controlling matching i don't want that stop telling me about controlling stuff and now i get this document because now i'm filtering actively filtering that out so some more and i'm kind of jumping around a little bit here but some more things to note is there a lot of interesting things that you can do you can do the analysis you can enrich a document you can index a document and you can re-index a document so if i call in rich it's not going to store the document on disk for me it's just going to give the document back the enriched form that i've specified in my pipeline if i say index it's actually going to store the document on disk and it'll keep it there and you know there's no cap theorem here if i've got 10 systems in place already uh all doing this stuff i have not gone to the point of like implementing the raft protocol to make sure documents are safe and all that stuff this is very basic at this point but it will store the document for you on the system and then if i call reindex all of the documents that are there on this system that i'm calling reindex on it will just send them right in right back into the search engine it doesn't touch them um another uh api call that i want to have in here is like a merge command where i can say i want to merge these attributes that i've got with the document that you already have for me so you don't have to re-enrich it you can just take you know my product ratings my product star rating and update the product star rating field on that document and that should be pretty easy so that's not a hard uh thing to do uh they're stored on disk there no there's no special database that probably needs to be a bit more robust um this also works with cupid uh and i want to give a shout out to the guys at amboss who i i was actually right in the middle of writing um a cupid pass-through quarry uh for elastic i had i had it done for solar where i could talk to solar from splainer through nlp through hello nlp um and i could talk to uh solar through hello nlp from cupid but i was just starting on elastic when uh the folks at amboss they released the cupid es proxy code which is on github um so i took some inspiration from those guys um it was an mit license but i left the copyright in there because they deserve full credit so i've used i've used some of their api calls here um and they work pretty well so now you can use cupid with this so if you're not on cupid it's a it's a relevance judgment tool that we develop at osc it's open source so that was uh that was an original goals again make sure that the tools that we have work with this stuff already um so that's been almost an hour i do have a couple more things to say but that's the end of the demo again you can download this thing let me go back into uh presentation mode thank you demo god you were kind to me uh so conclusions use nlp with search do that um if you're new brand new to nlp and you want to get started on your own there's a free spacey course uh from the explosion ai team inez who's an amazing uh contributor to the community she wrote this course she works at explosion ai and it's really really good so if you're just starting off with spacey go take this course it's very very helpful it's helpful in the way that it teaches you the api of spacey it doesn't teach you like use cases like real world use cases there are some good examples but the use cases that you and i deal with day-to-day as uh search practitioners those aren't really covered there so for that we have the natural language search training if you are interested in nlp and using nlp with search i talk about a lot of the concepts here i talk very heavily about measurement and how to understand if nlp is working for you or not because these things are models and they're complicated and they're not foolproof uh and you have to make sure that you're you're doing stuff right because otherwise you're gonna your accuracy is gonna be off um i also get into the whole transformer stuff we talk about vector search and we actually end up as the final lab writing uh from scratch a vector based search engine um an nlp and vector-based search engine uh in pure python which is uh which is a very very illuminating experience because then you really really learn the pitfalls of this stuff and how to wield them in the ways that you need if you are interested in this project please go visit the hello nlp source and take a look the glue is like just kind of drying it's still in the clamps so it's not like don't put it into production and you know if you've got a million visitors per day don't like stick this thing in production right now um but it's definitely usable and it's definitely used usable for experimentation uh the api isn't 100 so i've labeled it 0.9 um i want to do a couple more things before i get to 1.0 but uh it's labeled this beta right now so uh that's where we are with that project um and now i'm going to open it up for questions thank you max that's fantastic um we've got a couple of questions uh first on the chat um i've got uh alex's question uh alex you want to already asked your question alex rafanovic yeah sure um i mean it's theirs i was just trying to understand whether this pipeline is index only or index and query it seems both but it's both and you can you can set up uh the pipeline is just json so you can create any analyzers that you want um in python and then reference them in the pipeline and use it at query time or at index time if i actually go back to the demo uh okay if i can just clarify like how much could i get out of it by it being just index because i could index offline and not worry about it being 0.9 but then if i wanted to point to production solar these are you know like how much am i gaining from just index time oh yeah uh i i would recommend that at the very least you would limitize a query but you don't have to use this to do that the thing with limit limitization is a big part of fixing a lot of the problems that we have in search um you will in theory get lower recall uh but you're not going to get nearly as much noise so my recommendation is sure you can use it and you can do the payloading stuff so you could payload and you could do the antidot you can do the entity extraction and all those things and just point your search directly at solar like you don't have to go through hello nlp but if you want to lemmatize to um get matches on on the lemmatized terms that you've enriched then you would uh probably want to use either this or just limitize your own using spacey or some other tool like you know if you have if you're using another stack and not python um you know if you're using java for example you could use open nlp and limit eyes it's probably very very close uh test that first before you do it for real but yeah so you could use this as purely just content enrichment if you wanted to uh and you would get a lot of value out of just the payloader and the entity extraction you could also write all kinds of other language things or you could write a classifier in python or you could do whatever you want you could say well oh man i've got these 18 content types and i got a million documents i want to classify each document against this content type i don't want to do that in in the scene or solar elastic so you could you know stick gentlemen here or something uh and and write a classifier uh that's very very straightforward to do wow somebody's drawing cool didn't know we allowed that so thanks for your question alex um uh david rhodes uh uh do you want to re-ask your question sure thanks max this was uh really really interesting one question i had is at the beginning of your pipeline you have the html strip step and was wondering if uh that is extensible to xml in general that instead of just looking for html tags like bold that if i had xml tags that are important to me that i could weight those yeah absolutely um so one thing that i want to so the short answer is yes this uses lxml which supports xml not not just html out of the box at least there's also some other things that i tried with uh like beautiful soup four and some other and then just the native python html parser um but lxml will definitely parse the xml for you i haven't messed around with using a schema like an xsd or anything with this yet but again it's just python so you could write an analyzer to do whatever you wanted to do in python and as long as you're outputting either text or spacey tokens or lists or dicts or whatever and your next plug-in and your next stage in the pipeline the input matches the output so you can chain them together and there's no it's not like blue scene where you have like a token stream and you have to always conform to the same token stream you could write whatever you could go hog wild and duck type and go crazy nonsense with this stuff and you can have all kinds of crazy objects that connect to each other it's just python right and you then you're bound to like making sure that it's doing it in the order that you specified but you could easily very easily parse xml using your own schema and do whatever you want to do with that data and just send it on send it on down eventually you're going to have to output text at some point uh to give it over to your search engine fantastic does that answer your question there david yeah thank you very much wonderful so uh we've got no more questions in the chat but if anyone else has got a question for max uh feel free to unmute put your video on and and tell us a question yes it's alex again if i can i was just curious about um comparison to quirky i don't even know how it's spelled again because because they did it as a search component and kind of inline and so on so seems like semi-complementary semi-competitive i was yeah oh it's absolutely not competitive and i should have mentioned quirky before uh renee kriegler the maintainer of quirky has written an amazing tool and this actually really complements quirky because you could do a lot of the nlp pre-preprocessing to like line things up and then manage rule sets in quirky which is really easy to do uh using a using either quirky rule sets that you write manually or using the smoothie tool i actually thought about throwing an example in here where that might work um but uh i didn't have time uh before this talk so yeah you could definitely pre-process your query in such a way so that quirky can pick it up to do a lot of other cool stuff because quirky works really well um already with like the edis max parser and the dis max parser i think in elastic fantastic um hassan has a question is it possible to use hello nlp with different nlp packages instead of spacing as long as it's python sure you could do whatever you want um i didn't show any examples but you could spin up uh hugging face uh transformers um is one thing that i want to get to eventually where it does a vector search with elastic using the vector fields and elastic um i didn't have time to get there but as long as you have something that's written in python and you want to use it go right ahead the tokenization is is written in spacey if you don't want to tokenize with spacey you could write your own tokenizer using nltk or something else uh it's totally up to you and uh vincent asks does the html strip keep uh meta or other highly significant html tags i guess that depends on the um the library you're using the way that it's written now it doesn't this doesn't actually keep any any tags it throws it all out you could very easily write like uh i guess a scraper or whatever you wanted to call it to look at the an html document and say oh this title field i want to use this as my document title and i'm going to send it over here and oh this is a meta description so i'm going to use that as my as my summary field or what have you um that should be pretty straightforward great and uh alex has another question uh what about noun phrases instead of just head nouns yeah so that skip trunk actually does noun phrases and you can chunk in spacey also i'm not i'm not crazy about this the chunker in spacey because it includes uh determiners so if you have you know the haystack conference it'll keep the the where i think that should be chucked out but the skip chunk tool that i wrote which is just a simple finite state machine will chunk uh nouns together and then you can reference those as um as heads um so you could you don't have to go through the complexity of that thing you could just do a chunking pre uh pre-step and then uh re-tokenizing spacey and then use the compounds as uh as your nouns and then weight those uh together fantastic do we have any more questions oh um alex is back you're getting the points today alex uh how old does this deal with technical texts like the solar reference guide i wonder why you're asking that question alan i don't know um and here's where i'll get into my big caveat right so these things are models i didn't train the model the model is trained on onto notes 5 which is comprised of mostly uh readable text for humans and i think also speech recordings uh transcribings um so things like news uh wikipedia uh that kind of stuff right so if you've got some crazy domain specific stuff like the solar documentation or you know the uh federal code of regulations or pubmed or something like that that has a lot of vocabulary and form that's not used in the out of the box spacey model you're going to have trouble so that's why you can tell it what your model is so you can train your own spacey model the courses.spacey.io link walks through model training so you could train your own spacey model and just reference it in here it's actually so this is a model you can download from the web from spacey uh or from the explosion ai uh if you wanted to reference your own model it's just a path string so just like i've got this dot slash plugins you could put a dot slash my fancy model like right here um and if it's a spacey model it'll work fantastic okay suzanne you're next up we'd like to ask your questions uh yes can you hear me yes hi so how could i adapt it to non-english languages the hello mlp great question i'm sorry i only speak english i've written this with english in mind because that's my first and only native english why i like to say i speak english and bad and bad english so two languages like like the guy from die hard but there are other spacey models uh for different languages there's a lot of community support for different languages so depending on what your language is you could uh you could find out what the model name is and drop it in here so i think like for spanish it's es francis fr uh things like that so i think french german spanish italian like the romance languages um and i think russian and maybe even chinese and and some other asian languages are supported visit explosion.ai and see what languages they support because they do have a lot of out-of-the-box models that being said um you know when i was writing this if you look at you know how i'm looking at stuff with the dependency parts and things like that i don't know how dependencies are parsed in in different languages right so head nouns may be tagged a different way i'm not sure so if you're doing that you probably want to have a look at the payloader and see you know how your language is structured um it's pretty easy actually i didn't show this but i've got this notebook here this isn't open source it doesn't need to be it's just a couple lines of spacey but you can see that it's pretty easy to see relationships here so i've loaded my model and i can just type in text and i can see what it looks like so i would encourage you to just get started with spacey at first download your language model play around with it see how it's parsing if it looks accurate to you do some testing label some data and uh and see if it does what you need i do believe spacey will publish their onto notes training results i don't know exactly where but i think that they'll talk about accuracy in certain things depending on the language and model that you are looking to use yeah i have actually used the spacey model for german already and unfortunately the lemmatizer is not so good but you can plug in other limitizers and so it's very extensible but another problem i have in e-commerce the product information often is only one word so the most important field is usually category color and title and they are only maybe one two or three words and uh i cannot do part of speech taking so i can't do lemmatizing and then all this um yeah yeah in the content you probably can't but when you somebody queries you know that that's a really interesting thing so if i if i did a search for um like the shirt without stripes for example i have no idea what that is in german but you know you may have a category which is like you know shirt and then you might have an attribute like style which is stripes so you could parse the query and then route the query how you want so you could do filtering and boosting and things like that um ecommerce is is always a challenge with these things um i didn't show a duckling example but duckling is a really really powerful value uh entity parser that recognizes dates currencies measurements a lot of the things that people will use when they're doing ecommerce searches so like i want to search for a price you know i want to search for a certain size of something so duckling is actually a nice tool to be able to throw a bunch of text at it get back the named entities for the values that exist there and then you can further parse that so the way that you would integrate that with um the way that you would integrate that with hello nlp is you would use the requests uh module to do a service call out to duckling which is its own service could host it on the same server and another one in a different container or you could use the async i o hd uh the aio http to make sure your async is working in a performant way um and you could go out and get that stuff and then use that further in your analysis chain so i i did i did want to write a duckling example i didn't have time before this talk but yeah there are there are a lot of ways that you can use this beyond just parsing text uh in your content you could use it um for parsing queries and looking deep into the queries and and finding what the intent is in the query and what the attributes are and the modifiers and routing the query in a in an appropriate way which is very often a challenge with uh with solar elastic out of the box thank you suzanne um hassan had a question another question uh is it possible to use payloads with elasticsearch it is i was right in the middle of writing my payload query when uh we started this talk and where is it uh it's in here somewhere or vector something something i must have closed it um let me see me see if i say that max has promised more on this uh library uh in the form of blogs uh if you're not already following us uh our open source connections blog i'm sure this won't be the last time we hear about all this stuff so all the things that max couldn't fit into his talk i'm pretty sure would appear in future blogs or even future talks so uh this is kind of uh an almost working painless script to grab the payloads josh devins is on the call he should know but yes payloading scoring is possible i haven't got there yet um but send josh a note on slack to ask about payload boosting and elastic works sorry josh i don't know off the top of my head but yeah slap me off i'll look it up i'll find someone who likes yeah you're not already uh on relevant slack do jump on uh where you can ask us questions about this stuff and and also people far cleverer than us will also be able to answer uh the links in the chat um so we have a last question here uh uh could you pokey it um uh could you please tell us about the status of the vector search and its integration with solar elasticsearch oh um so the vector search is supported and elastic it's kind of hit and miss with solar uh there's a great blog post on the vectors and search channel in elastic that has um it's a comparison between vespa and elastic uh on vector uh vector-based scoring and boosting um and there was a lot there's an elastic demo out there somewhere i think maya's on the call but maya wrote a lot of the elastic search vector stuff um a year and a half ago i think uh so the i think the status of vector search in elastic is okay i don't think it's amazing i think that there's a lot of stuff the problem is you're trying to store vectors in leucine data structures and that's hard for various reasons you also lack gpus and other things that make the things much faster java isn't the best at doing the doing the performance related stuff solar i think is doing not nearly as well in this area there are a bunch of efforts out there uh both at the lucine layer started by some solar committers and also a train ranger had an example using streaming expressions but their examples they're not for uh the faint of heart you have to spend a bunch of time like building plugins and sticking stuff in there and getting things to work in a custom manner personally i wouldn't go down the road of using a lucine-backed engine for vector search yet i would probably use another tool that supports approximate nearest neighbor search as a supplementary database to your search engine and do things there and then get the scores and then merge merge the scores somehow so maybe send the score in as a parameter in your query like do the a n first um and get the score back uh about a against your top thousand documents or whatever and then use that score if you see it in the recall for uh you know for the uh the results that you get back from your lucian based engine um there aren't amazing patterns on this yet there the the other interesting thing that is that's a great question and you know we could talk about this for hours but one of the other reasons i started on this project is i was seeing a whole bunch of other um little search engines popping up uh that tried to solve this problem with vector search and doing similar things so there's like gina ai there's a uh interestingly a search a new search engine called haystack um which is kind of funny those are open source tools and there are a couple of others out there that are like new search engines um and there's also anserini and some other things but the reason i did this is because nobody's going to migrate their search engine to be able to use the vector search just yet unless you're doing unless you're probably migrating to vespa which i think probably does the best job of all of them on this right now because they've been doing it for a long time so i was like well nobody's going to drop their big elastic search cluster or solar cluster and switch to gina.ai right now um you just have way too much business backing this stuff you don't want to replace your search engine it's like replacing your database you know i want to treat search like a really really fancy database but you want to have a lot of other nice stuff that you want to be able to do with it that currently isn't you know possible in a packaged manner right now so this the reason that's the reason that i wrote this library is because i wanted the folks who are running elasticsearch and solar to have something that they can use as middleware that does a bunch of stuff so they don't have to switch to an another search engine to have you know nlp capabilities but again going back to your original question uh i don't think that the vector search is solved just yet uh but you can try and uh if you do try please share your results we'd love to see it fantastic um so max has just popped up our closing slide here so do join relevant slack if you're not there already uh the video of this talk will be up on youtube as soon as we processed it um there's lots more there including all the talks from our previous haystack conferences that were videoed um join the meet up and please do spread the word um and get kept using max's tools and i'm sure he'd appreciate pull requests and and he'd appreciate feedback issues issues uh try it out write some stuff up um we've got more events coming up uh the next haystack live talk will be october the 15th with uh james rubenstein of lexisnexis talking about building a data-driven search program which i think is going to be a fascinating talk if you haven't seen james's posts on medium recently do check them out they're fantastic uh really about the process of uh relevance tuning and then we have our trainings of course uh if you want the solar training next week getting quick uh hello learning to rank on the 20th of october and natural language search um on the 17th of november taught by someone you may recognize yes it'll be max explaining the mysteries of natural language search to you so do check that out um and also i will give you very early notice that we may nothing's confirmed yet there may be some kind of haystack event before the end of the year that's all i'm going to say uh i will obviously give you all notice as soon as i possibly can but we're hoping to try and keep the community going um and get people meeting and talking about search as much as we can so yeah thank you very much for everyone for coming and thank you max for being our speaker today and we'll see you all next time hopefully on october the 15th thank you so much
Info
Channel: OpenSource Connections
Views: 1,905
Rating: undefined out of 5
Keywords:
Id: vSspoJ_VkMg
Channel Id: undefined
Length: 78min 43sec (4723 seconds)
Published: Thu Oct 01 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.