Beginner's Crash Course to Elastic Stack - Part 2: Relevance of a search

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right thanks for your patience everyone um so yeah we're here uh lisa is going to be presenting the second part of our beginner's crash course to the elastic stack and if you want to go to the next slide um one thing i wanted to mention to you and again i will drop these links into the uh chat channel is that we have a variety of avenues for you to get connected with the elastic community so one of them is through our meetups which you are a part of right now and you can find the latest meetups that we are holding all across the united states uh at uh our amir virtual user group um so there's a qr code and link there you can also join our elastic community slack workspace which has channels on every topic under the sun machine learning data science observability security you name it you can join a channel there and there's also a job seeking channel so if you're interested if you're already um an engineer who's familiar with elastic and you're looking for a job there's opportunities there so definitely check that out and then uh our community youtube channel has many videos if you're looking to learn more dive deeper into a particular subject we have playlists based on our solutions and so the link and qr code are there as well and then uh we also have if you want to go to the next slide lisa our contributor program so we always invite folks to contribute to the elastic community either via blog posts or tutorials you can do it via code submissions um maybe you found a bug and you want to help fix it um so we recognize those folks with a program where you earn points for your contributions and then we give you an award at the end of the cycle so learn more about our contributor program uh by following that link in qr code and then i think was that it lisa i believe so okay awesome carry on okay thanks phoebe hi everybody welcome to the beginner's crash course the elastic stack i'm lisa and i'm a developer advocate at elastic so last month we kicked off the beginner's crash course series during the first workshop we talked about the exciting use cases of the elastic stack then we dive straight into the architecture of elasticsearch then we perform crud operations with elasticsearch and kibana and today's workshop builds on the content of the first workshop now if you missed the first one not a big deal on your screen you have a link to a github repo and this repo contains a video of the first workshop as well as two blogs that go over the same concepts so if some of this stuff seem unfamiliar to you then refer to this repo and get caught up afterwards and phoebe will post this link in the chat as we speak okay so before we get started let's do a quick recap so if you're a developer working with data the elastic stack is a great tool to have on your belt now it consists of four products beats logstash elasticsearch and cabana and with the stack you could take data from any source in any format then search analyze and visualize it in real time so today we'll focus on elasticsearch which is the heart of the elastic stack and this is a search and analytics engine that powers a lot of the apps that you use today so for example if you've ever searched for a restaurant on yelp or search for groceries on instacart elasticsearch is the engine that is powering that search so search is an experience and whether you're searching for a documentation or your favorite show you expect fast and relevant results no matter the scale in the first workshop we talked about how elasticsearch allows you to get fast search results at scale and today we'll talk about the relevance aspect of the search so we search for things on a daily basis and as developers search is our lifeline right so whether we're fixing a bug or building a certain feature we go straight to the search bar and hope that somebody has already figured it out and shared it online now it's really frustrating when you're searching for an answer and you're not quite getting what you're looking for and that is what relevance is all about so when you search for something on your app you want results that are directly related to what you're searching for which brings us to question well how do we measure the relevance of our search results now two factors that we'll focus on are precision and recall so let's delve into these a little bit more as a quick review we know that elasticsearch is a search engine and it allows us to store search and analyze data and it stores data as documents and documents that share similar traits are grouped into an index so when you search for something elastic search retrieves relevant documents then it presents them as search results which is highlighted in orange here now on this slide we have two diagrams depicting the same thing so on the left we have documents grouped into an index and the same thing is shown on the diagram on the right so the yellow rectangle represents your sample index and the gray dots are documents contained in that index i'm going to use a diagram on the right to explain precision and recall but before we do that let's go over some terms real quick so when you send a search query to elasticsearch it retrieves documents that it considers relevant to the query now these are the dots inside the white circle and these are documents that elasticsearch sends back as a response now some of these retrieve response or some of these retrieved documents are what you expect to see in your response and these are known as true positives now you probably had an experience where you searched for something and some of the results were not relevant to what you're looking for and these are known as false positives so these are irrelevant search results that were retrieved by the search engine for so for some reason now let's focus on the dots in the yellow region of this diagram now these are documents that were not returned by the search engine and some of them are truly irrelevant to the search query and were correctly rejected by the search engine and these are known as true negatives now among the rejected documents there may be relevant documents that should have been returned in the response and these are known as false negatives so earlier i've mentioned that precision and recall are used to measure the relevance of a search engine and the term precision has to do with dots inside the white circle and these are documents that are returned as search results so precision is calculated by true positives divided by the sum of true positives and false positives so what precision tells you is what portion of the retrieved data is actually relevant to the search query now recall on the other hand is calculated by true positives divided by the sum of true positives and false negatives and what recall tells you is what portion of relevant data is being returned as search results so precision and recall are inversely related precision wants all the retrieved results to be a perfect match to query even if it means returning less or no documents whereas recall focuses more on quantity so it wants to retrieve more results even if the documents may not be a perfect match to the query now the dilemma here is that we want to present really relevant items but we also want to retrieve as many results as possible so as you can see these two factors are at odds with each other because if you want to improve precision it might cause a decline in recall and vice versa okay so let's recap real quick precision and recall determine which documents are included in the search results but precision and recall do not determine which of these return documents are more relevant than the other this is determined by ranking so when you look at your search results you'll see that the most relevant results are at the top and the least relevant are at the bottom and this ranking or order is determined by a scoring algorithm so each result is given a score and ones with the highest score are displayed at the top whereas ones with the lowest score are displayed at the bottom now score is a value that represents how relevant a document is to that specific query and a score is computed for each document that is a hit and hits are search results that are sent to the user so the higher the score a document has more relevant the document is to the query and it's going to end up higher in the order now there are multiple factors that are used to compute a document score and for this workshop we'll only focus on term frequency and inverse document frequency so let's break this down now when you search for something you type in a search query in the search box an elastic search looks at the query and pulls up relevant documents or hits then it calculates a score for each document and ranks them by relevance so how does this happen now let's talk about how term frequency plays a role in calculating a score so here we have a search query how to form good habits and this query is made up of multiple search terms so elasticsearch will look through the return documents and calculate how many times each search term appears in a document now this is known as term frequency so if a document mentions search terms more frequently elasticsearch assumes that this document is more relevant to the search query and it assigns a higher score to that document so let's say we're looking at the frequency for the term habits now in the field description of the first document the term habits appears four times now in the same field in the second document the term habits appears one time so in this example the first document will be given a higher score and end up higher on the search results now when we calculate a score based on the term frequency alone this will not give us the most relevant documents now this happens because term frequency considers all search terms to be equally important when assessing the relevance of a document so let's look at the search query here we have search terms how into inform and good and habits now not all of these search terms will help you determine the relevance of a document for example the first four search terms are commonly seen in many if not all documents so now if you look at the hits then at the documents highlighted in orange these documents like how to form a meetup group or good chicken recipes they do contain some of the search terms but these documents are completely irrelevant to what we're looking for but because of term frequency if these commonly found search terms were found in high frequency in any of these documents these documents are going to end up with higher scores even though these are irrelevant to the query so elasticsearch offsets this with inverse document frequency so with elasticsearch if certain search terms are found in many documents in the result set it knows that these terms are not useful at determining relevance so when it goes through all the hits it will reduce the score for documents with unimportant search terms and it will increase the score for documents with important search terms like habits so we just covered the basics of relevance next we're going to fine-tune precision or recall of our search results and we're going to do that by sending queries from kibana to elasticsearch so in the first workshop i went over how to download elasticsearch and cabana and run these locally on your own machine and this option is free and there's no expiration date but i found that one of these run these locally and screen share at the same time zoom slows everything down so like live demo is almost impossible so when i run these on elastic cloud it seems to do better with while screen sharing so this time i'll show you how to get up get set up on elastic cloud so i'm going to breeze over this setup i don't expect you to set this up during the presentation i just want to show you what you need to do so if you want to try this on your own you can watch the recording and get started okay so here we have a link to the github repo for today's workshop so phoebe will you drop the link in the chat i think you did a little bit earlier but i did but i'll do it again just in perfect anybody thank you okay so go to this link and have this pulled up now you'll see this on your screen so this repo contains all the resources that are shared during this workshop and the slides and the recording of this workshop will be included here as well now if you scroll down to the resources section you'll see the link to free elastic cloud trial you'll right click on that and open the link in a new tab now the link will take you to the free trial page and elastic cloud hosts elasticsearch and kibana as a service and unlike the downloaded option it does all the heavy lifting of managing the stack so you usually get a free trial for 14 days but the link in the repo will give you access to it for 30 days now there's no credit card required and the trial will expire on its own so to get going enter your email and click on start free trial and once you do that it'll ask you to go to the email account you signed up with and verify your email so go to your inbox and click on the email from elastic then click on verify an accept button and it will prompt you to set your password and once you set your set your password and log in it'll take you to this page where you'll click on start free trial next you'll select the elastic stack when you do that you'll see this drop down menu where you can configure your settings so if you look under select hardware profile elasticsearch offers several deployment templates for different use cases and workload and each template selects the appropriate cloud hardware configuration for different needs but if you're just getting started or don't quite know your needs yet then go with a recommended i o optimized option then you choose a cloud provider of your choice so let's say you have an app and you want to integrate elasticsearch but if your app is running on google cloud you don't want elasticsearch running in a different cloud provider because that'll cause latency issues but for what we're about to do it doesn't matter which one you choose so just select one then select the region close to you and select the latest version of the elastic stack which at the moment is 7.10.1 so if you scroll down you can name your deployment to whatever you want i need my beginner's crash course then click on create deployment now once you do that you'll get your deployment credentials and you'll need this when you add data to kapana so either download the credentials or save it somewhere else as these are only shown once and when you click on download or continue without downloading it'll create your deployment and load kibana so once everything is loaded click on open kibana and after that both elastic search and cabana are ready to go now when you advance to the next page click on explore on my own option then it'll take you to this home page so in order for us to explore the relevance of our search we need to put some data into elasticsearch so we have something to search for now if you have your own data in csv and the json or log file it's really simple to get data into elasticsearch so scroll down on the home page and click on upload a file option and for our tutorial i'll be using a news category data set from kaggle and this data set contains news headlines from huffpost from the year 2012 through 2018. now i included a link to the data set in the getup repo so if you want to try it out on your own later you know where to find it so just make sure you download and unzip the data first then drag and drop drag and drop the data set here okay so once you do that kibana will give you an analysis of the first thousand lines of your data and give you a summary of your data set so if you look at our file content you'll see several documents of news headlines under different categories like crime or entertainment and under summary we also see that our data is an nd json format and the time field is date and when you scroll down you'll see the field section and in this data set each document contains fields such as authors categories headlines and etc and we also see some high-level statistics for each one so scroll down and click on import button and this will import your data into elasticsearch and when we're pushing data into elasticsearch it'll group all documents into an index so we could easily find it so give your index a name i named my news headlines all lowercase then click done import then elastic cloud takes care of the rest so you have data in elasticsearch and you also have kibana up and running and cabana is a ui used to visualize and explore data in elasticsearch so for our tutorial we'll use a cabana console to search for data and improve the precision or recall of our search results so click on the menu icon in the upper left corner and in the drop down menu scroll all the way down to management section and click on dev tools and this will pull up the cabana console so click on this miss and delete the default query here and now we're finally ready to get started right okay let me get organized here real quick okay so i have two windows open side by side on the left i have the cabana console and on the right i have our workshop repo so we have two goals one is to search for information and two is to fine-tune precision or recall of our search results and we'll be using elasticsearch and kibana to get these done now we just got data into elasticsearch so all we have to do is to use kibana to send search requests to elasticsearch so let's start by exploring our data first now when we added data to elasticsearch it stored our data in an index called news headlines and we want to get a feel for our data so knowing what our document actually looks like or how many documents we have would be really helpful now there's a query that gets you that information so go to your repo and scroll down to search for information then down to retrieve all documents from an index so for every request we'll cover i included the general syntax for you so you could customize this for your own use case but for our tutorial we'll use a request shown in the example so what we're saying here is get search results from index news headlines so we'll copy and paste that into our console make sure to click and select it and there's a dark grey bar over it and click on this green arrow to send the request now what this does is it'll give you information of all documents in the news headlines index so in the right panel you'll see the search results from an elastic search and if you scroll down to line 16 it gives you a sample of 10 search results by default so let's scroll through and see like what documents we have here anything interesting all right let's pick this one so this document is from index news headlines and if you look under source you'll see all the fields or content that this document contains so it shows the date when the article was published the link to the article category headlines and etc so with the search requests you get a general idea of what a document looks like so we know what we're dealing with so let's see how many documents that we have so go back to line 10 now it tells you the total value of hits is 10 now by default elasticsearch limits the total count to ten thousand and this is done to improve the response speed on large data sets so to see if ten thousand is the exact total number of hits you have then check the relation field below now if you see an eq in this field that means the value is equal to the total number of hits but we see gte which means that our number of hits may be greater than or equal to 10 000. so let's say we want to know the exact total number of hits so to do that look at your repo and scroll down to get exact total number of hits and down to the example now this is almost identical to the request that we just sent so we're saying hey i want to get search with headlines by the way i want the exact total number of hits in the response so let's copy and paste that make sure to select it and send okay so depending on how big the data set is this query may have a slower response time so if you look at line 10 hits you'll see that the value is now 200 000 in 853 and if you look at relation you see eq which means that this is the exact total number of hits so we have over 200 000 article headlines and we want to find interesting patterns in our data but it's hard to know where to even start so one way to narrow it down is to search for data within a specific time range that you're interested in so let's scroll down to search for data within specific time range and down to example so there are two main ways to search an elastic search and these are queries and aggregations so queries tell elasticsearch to retrieve documents that match the criteria and right now we're searching for documents that match our time range criteria so for our use case we need to send a query so this is the query that we're going to send we start by get search results from news headlines now i'm querying data from certain time period so the type of query that i'm sending is range and the following are criteria that for documents in my search results so i want you to only look at the date field and pull up all the articles that have been published between these two dates so let's copy and paste that into our console make sure to select it and send okay so you'll see that we got over 8 000 hits and if we were to look through all these documents and look at the date field where is that you'll see that all of the documents have been published between the time range that we have specified all right so how can we narrow down our search well when you look at documents in our search results we see that these articles belong in different categories so if you look this one belongs in media and the other one let's see it bring it belongs to parents category so to explore our data further it'd be really helpful if we knew what type of news categories exist in our data set so let's scroll down to aggregation section now remember there are two main ways to search in elasticsearch and these are queries and aggregations and queries are used to retrieve documents that meet the criteria but in this case we're not interested in grabbing documents what we want to know is a type of news categories that exist in our data set so to get this information we need to analyze the data and get the summary of categories that exist in our data and this type of search is known as aggregation but this time we'll send an aggregations request so turn to your repo and scroll down to example so the aggregations request is pretty similar to the query request that we just sent so we're saying hey i want to get search results from news headlines i'm sending an aggregations request and i want you to name this report by category now you're going to run an analysis on the following terms and that term is field category and bring me up to 100 categories if you got them so let's copy and paste that into the console make sure to select it and send now you're going to go to line 10 and click on this downward arrow to minimize it then you'll access the aggregations report that we named by category and if you look at field buckets you'll see an array of all the categories that exist in our data set so it seems like we got politics here wellness entertainment and travel and each category has document count which tells you how many articles have been written for that specific category so now we have a lot more to work with so let's scroll down to combination of query and aggregation request and scroll down to examples so it seems like the entertainment category contains a lot of articles and i want to explore that a little bit more so what if i wanted to identify the most popular topics in an entertainment category now this is a combo of both query and aggregations requests because first you're going to pull all the documents from the entertainment category so you've got to query the data first then you have to analyze query data and give the summary of the most significant topics in the entertainment category so let's go down to the example here so what we're saying is get search results from news headlines index first i'm sending a query request now i want to bring me all the documents that match the following criteria and the documents will be from category entertainment and second i want you to run aggregations on the documents that we just queried and i want this report to be named popular in entertainment and i want you to run an analysis on the significant text that is found in field headline so let's copy and paste this into our console here make sure to select it and send now go to line 10 minimize it and you'll get access to aggregations report that is named popular in entertainment and under buckets you'll see an array of all the popular terms in our entertainment category so it seems like the word trailer is pretty significant it's been found in 387 headlines it seems that movies won taylor's one and kardashian is one so one of my guilty pleasures is watching keeping up with the kardashians so i'm gonna look up articles that mention khloe kardashian and kendall jenner so scroll down to precision and recall and down to examples here okay so i'm going to send this query here so what this saying is get search results from news headlines index now i want you to query all the data that matches the following criteria now bring me all documents that contain these search terms in the headline field so let's copy and paste that into our console selected and sent okay now you'll see that there's a total of 926 hits so let's go over the search results okay so this is the first document here i'm going to look at the headline so it seems like all four search terms have been found in the headline so that's good now let's look at the second one all right we have kendall jenner but no khloe kardashian let's look further down let's see how about this one okay so this one has kendall jenner but instead of chloe we have kourtney kardashian so kind of relevant but not really the next one now so it's talking about nick jonas and kendall jenner so our search results we turned a lot of articles that mention some of these search terms but these are not perfect matches to the query well why is that happening so notice that our search search request search query contains four search terms chloe and kardashian or kendall or jenner now by default the match query uses an or logic so with this query a document is considered as a hit if it contains even one of these search terms in the headline now earlier we went over the concepts of precision and recall so i've got a question for all of you so is this query better for improving precision or recall i want you to type the answer in the chat box okay button says we call anybody else recall okay all right so you guys have been paying attention so the correct answer is recall so by default our match query uses an or logic so if a document has at least one of these search terms it's going to be a hit and because of that we have high recall because we get an increased number of loosely related hits coming our way now what if we wanted to increase precision instead so let's scroll down to increasing precision and down to example so we can increase precision by adding an and operator so we're using almost the identical query that we just sent except that we added an operator parameter and and what this is saying is saying is only pull up documents that contain all four of these search terms in the headlines so let's copy and paste this our console oops ah one second all right there you go send okay so with the previous query we got 926 results but with this query we got one so once you scroll down to this one document that we got and look at the headlines you'll see that all four search terms are in the headlines and this is a perfect match so we definitely improved our precision for sure but it feels like with our first query we widened our net too much and we're getting a lot of loosely related search results but with our second query we were way too strict and got one match so is there a way to land somewhere in between and the answer is yes so you could use minimum should match parameter so scroll down to that section okay so this parameter allows you to specify the minimum number of terms a document should have to be included in the search results and this gives you more control over fine tuning precision and recall so let's scroll down to example and this query is almost identical to the query that we just sent we just replace the operator parameter with minimum should match and we set this equal to three so what we're seeing here is for all the hits at least three search terms must be included in the headlines now of course you could change this number as well but we're going to copy and paste this request make sure to select it and send okay so now we got six results this time so compared to our last query we improve the recall and if you look at the documents let's see here all right so the first one has all four search terms great the second one it's got one two three okay and the third one has one two three four so we have at least two articles that are precise matches the query so we improve our precision as well so the minimum should match parameter is a great way to narrow the net without being too strict so this is a good parameter to use when you're fine tuning precision and recall so those are all the queries that i got for you now we went over a lot so let me see if you guys asked any questions in the q a section so christian asks can you combine rank or score with sort and this is totally my bad question i should have asked a follow-up question to understand your question better now documents are shorted by score by default so i interpret your question as you know if you have several different queries to sort the results can you combine it into one request and if that was your question the answer is yes so this actually has to do with the content of our future workshops so i'll give you a sneak peek now i'm going to show you a bull query as an example and this query combines multiple queries into one and allows you to sort the data even further and with this query you have four clauses to choose from and these are must must not should and filter and you can build a combination of one or more of these clauses to sort the data even further for example you could include what terms a hit must or must not contain and you could include the terms under should and if these terms are included in the hits you could increase the score of the heads and list them higher in the search results and you can also apply a filter to exclude any documents that do not match the criteria now we'll cover this more in depth in the future workshop so i won't go more into it but the point that i'm trying to make is that yes you can make you can combine multiple queries into one request to sort the data even further and there are multiple different ways to do that beyond what's been shown here so christian if that wasn't the question you are asking feel free to reach out to me via email and i'll get your question answered for you now pedro asked do you know how to delete the imported data set to import a new one and the answer is yes so during the workshop when we imported the data we created an index called news headlines and our data was indexed into the news headlines index so to delete the data you can just delete the index so it's simple as sending a delete request via the kibana console so the syntax that you're going to use is delete followed by the name of your index you're trying to delete and once you send the request if you see acknowledge true as a response it means that your data set was deleted now i usually double check by sending a get request to the index i just deleted and the syntax i'm going to use for that is get followed by name of the index then the search endpoint and once you send the request if you see an error in the response and it says index now found uh that means that your data set has been deleted so to import a new one you can follow the steps i showed during the presentation now i thought you might be asking in the context of replacing the old data set with a new one and using the same index name and index pattern now an index pattern is a collection of one or more indices that contain the data you want to visualize and this feature allows you to explore multiple indices at one time and elastic cloud automatically creates an index pattern for you when you create an index so if you're trying to use the same index name with a new data set you're going to run into an error that says the index pattern already exists so in this case you'll have to delete the index pattern before importing your data with the same index name so click on the menu icon in the upper left corner and click on home then on the home page click on manage option you'll see a list of index patterns on your screen so click on the index pattern that you want to delete then click on the delete button then you should be able to import new data set and still use the same index name and index pattern now stephen asked when you did the manual import from file does it let you adjust the mapping yes so those of you who are not familiar with mapping it's a schema definition that contains the names and data types of the fields of an index so let's go back to when we're uploading the data set during the presentation so when you drag and drop your data set here kibana will give you the file contents and summary of your data set now there's an option called override settings click on this option here then you'll get this pop-up menu and you can scroll down to edit the field names here and click on apply and once you do that you'll click on import button then it'll take you to this page and the next steps will be identical to what we went over during the presentation but if you want to adjust the mapping even further you can click on the advanced option and here you can name your index adjust the mapping and import and again steps after this will be identical to what we went over during the presentation okay so that was the last question in the q id session so we'll start wrapping up so we have another workshop in the series coming up and we'll be talking about full text queries and how we can customize search even more and this is happening on wednesday february 24th at 12 pm central time and the details will be posted on our meetup page in the upcoming weeks and it's called elastic austin user group and the link to this page is included in the repo now if you have questions about elastic the discussion forum is a great place to get your questions answered and we have a community of developers and developer advocates that answer questions on this platform so feel free to post your questions here and last but not least i often vlog about elasticsearch so if you prefer to learn by reading instead be sure to check out my blog all right that's a wrap so thank you so much for coming and i'll see you later
Info
Channel: Official Elastic Community
Views: 20,467
Rating: 5 out of 5
Keywords: data, elasticsearch, kibana, beginners, relevance, precision, recall, search engine, queries, aggregations
Id: CCTgroOcyfM
Channel Id: undefined
Length: 47min 26sec (2846 seconds)
Published: Fri Jan 29 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.