Elasticsearch Tutorial for Beginners | Learn the Elastic Stack Architecture | Frank Kane

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey internet so if you're in the field of data science or big data you may have heard some buzz about elasticsearch and you might be wondering how does a search engine technology help me in extracting meaning from my data at scale well it turns out that for some problems elasticsearch can give you an answer back in milliseconds when other systems like hadoop or apache spark might take hours so let's dive in together and you'll end up with a very powerful new tool in your big data arsenal let's start off with sort of a 30 000 foot view of the elastic stack and the components within it and how they fit together so elasticsearch is just one piece of this system it started off as basically a scalable version of the lucine open source search framework and it just added the ability to horizontally scale leucine indices so we'll talk about shards of elastic search and each shard in elastic searches just a single leucine inverted index of documents so every shard is an actual leucine instance of its own however elasticsearch has evolved to be much more than just leucine spread out across a cluster it can be used for much more than full text search now and it can actually handle structured data and aggregate data very quickly so it's not just for search it can handle structured data of any type and you'll see it's often used for things like aggregating logs and things like that and what's really cool is that it's often a much faster solution than things like hadoop or spark or flink they're actually building in new things into the elasticsearch all the time with things like graph visualization and machine learning that actually make elasticsearch a competitor for things like hadoop and spark and flink only it can give you an answer in milliseconds instead of in hours so for the right sorts of use cases elasticsearch can be a very powerful tool and not just for search so let's zoom in and see what elasticsearch is really about at a low level it's really just about handling json requests so you're not we're not talking about pretty uis or graphical interfaces when we're just talking about elasticsearch itself we're talking about a server that can process json requests and give you back json data and it's up to you to actually do something useful with that so for example we're using curl here to actually issue an arrest request with a get verb for a given index called tags and we're just searching everything that's in it and you can see the results come back in json format here and it's up to you to parse all this so for example we did get one result here called uh for the movie swimming to cambodia which has a given user id and a tag of cambodia so if this is part of a tags index that we're searching this is what a result might actually look like so just to make it real that's the sort of output you can expect from elasticsearch itself but there's more to it than just elasticsearch there's also cabana which sits on top of elasticsearch and that's what gives you a pretty web ui so if you're not building your own application on top of elasticsearch or your own web application kibana can be used just for searching and visualizing what's in your search index graphically and it can do very complex aggregations of data it can graph your data it can create charts and it's often used to do things like log analysis so if you're familiar with things like google analytics the combination of elasticsearch and cabana can be used as sort of a way to roll your own google analytics at a very large scale let's zoom in and take a look at what it might look like so here's an actual screenshot from cabana looking at some real log data and you can see there's a multiple dashboards you can actually look at that are built into cabana and this lets you visualize things like where are the hits on my website coming from and what are the error response codes and how are they all broken down and what's my distribution of urls whatever you can dream up so there are a lot of specialized dashboards for certain kinds of data and it kind of brings home the point that elasticsearch is not just for searching text anymore you can actually use it for aggregating things like apache access logs which is what this view in cabana does but you can also use cabana for pretty much anything else you want too later on this course will use it to visualize the complete works of william shakespeare for example and you can see how it can also be used for text data as well it's a very flexible tool in a very powerful ui we can also have something called log stash and the beats framework and these are ways of actually publishing data into elasticsearch in real time in a streaming format so if you have for example a collection of web server logs coming in that you just want to feed into your search index over time automatically filebeat can just sit on your web servers and look for new log files and parse them out structure them in the way that elasticsearch wants and then feed them into your elasticsearch cluster as they come in log stash does much the same thing it can also be used to push data around between your servers and elasticsearch but often it's used as sort of an intermediate step so you have a very lightweight filebeat client that would sit on your web servers logstash would accept those and sort of collect them and pull them up for feeding into elasticsearch over time but it's not just made for log files and it's not just made for elasticsearch and web servers either these are all very general purpose systems that allow you to tie different systems together and publish data to wherever it needs to go which might be elasticsearch might be something else but it's all part of the elastic stack still but it can also collect data from things like amazon s3 or kafka or pretty much anything else you can imagine databases and we'll look at all of those examples later in this course finally another piece of the elastic stack is called x-pac this is actually a paid add-on offered by elastic.co and it offers things like security and alerting and monitoring and reporting features like that it also contains some of the more advanced features that are just starting to make it into elasticsearch now such as machine learning and graph exploration so you can see that with x-pac elasticsearch starts to become a real competitor for much more complex and heavyweight systems like flink and spark but that's another piece of the elastic stack when we talk about this larger ecosystem and you can see here that there are free parts of x-pac like the monitoring framework that will let you quickly visualize what's going on with your cluster you know what's my cpu utilization system load how much memory do i have available things like that so when things start to go wrong with your cluster this is a very useful tool to have for understanding the health of your cluster so that's it at a high level the elastic stack you know obviously elasticsearch can still be used for powering search on a website you know like wikipedia or something but with these components it can be used for so much more it's actually a larger framework for publishing data from any source you can imagine and visualizing it as well through things like habana and it also has operational capabilities through x-pac so that is the elastic stack at a high level let's dive in more into elasticsearch itself and learn more about how it works so before we start playing with our shiny new elasticsearch server let's go over some basics of elasticsearch first so we understand the concepts of how it works what it's all about how it's architected and when we're done with that we'll have a quick little quiz to reinforce what you learned after that we'll start messing around with it so there are three main logical concepts behind elasticsearch the first is the document so if you're used to thinking of things in terms of databases a document is a lot like a row in a database it represents a given entity something that you're searching for and remember in elasticsearch it's not just about text any structured data can work now elasticsearch works on top of json formatted data if you're not familiar with json it's basically just a way of encoding structured data that may contain strings or numbers or dates or what have you in a way that you can actually transmit it across the web cleanly and you'll see a ton of examples of this throughout the course so it'll make more sense later on now every document can have a unique id and you can either explicitly assign a unique id to it yourself or allow elasticsearch to assign it for you and it also has a given data type that describes what sort of thing this document is so for example you might have documents that represent encyclopedia articles or in documents that represent log entries from your web server and that's where the concept of types comes in so you can have many documents that belong to a given type and a type is basically a schema or a mapping shared by a bunch of documents so for example you might have a type that defines what an apache access log entry looks like and i might define a mapping that says an apache access log contains things like a request url and a status code and a request time and a referring url and things like that or you might have a type that represents an encyclopedia article for example that represents things like the text of the article itself the author of the article the date the article was written whatever else there might be the title of the article so you can think of a type as a schema and again if you take this back to a database analogy it's a lot like a table where you define the individual columns that are in a given row or a document in our terminology finally there's a concept of an index which is a collection of types and it is basically an entity that you can search across so if you need to search across multiple different types potentially then you'd want to make sure that all those types are contained within the same search index so an index is sort of the highest level entity that you can query against in elasticsearch and it can contain a collection of types which in turn contain a collection of documents so again bringing this back to an analogy of a database you can think of an index as a database a type as a table and a document as a row those are kind of the analogies there of database to elasticsearch world now of course it's not quite that simple basically what is an index well an index is actually what's called an inverted index and this is basically the mechanism by which pretty much all search engines work so the idea is that if i have a couple of documents let's let's assume they just contains text data here let's say i have one document that contains space the final frontier these are the voyages and maybe i have another document that says he's bad he's number one he's a space cowboy with a laser gun and if you understand what both of those are references to then you and i have a lot in common now an inverted index wouldn't store those strings directly instead it flips it on its head so what it does in a search engine is it actually splits each document up into its individual search terms and in this example we'll just split it up for each word and we'll lowercase them just to normalize things and then what it does is it maps each search term to the documents that those search terms occur within okay so in this example the word space actually occurs in both documents so my inverted index would indicate that the word space occurs in both documents 1 and 2. the word the also appears in both documents so that will also map to both documents one and two and the word final only appears in the first document so our inverted index would map the word final the search term final to document one now it's a little bit more complicated than that in practice and in reality it actually stores not only what document it's in but also the position within the document that it's in so we can do things like phrase search and stuff like that but at a high conceptual level this is the basic idea an inverted index is what you're actually getting with a search index where it's mapping things that you're searching for to the documents that those things live within and of course it's even not quite that simple so how do i actually deal with the concept of relevance let's take for example the word the how do i deal with that so the word the is going to be a very common word in every single document so how do i make sure that only documents where the is like a special word are the ones that i agree get back if i actually search for the term though well that's where tf idf comes in that stands for term frequency times inverse document frequency it's a very fancy sounding term but it's actually a very simple concept so let's break it down term frequency is just how often a given search term appears within a given document so if the word space occurs very frequently in a given document it would have a high term frequency or if the word the appears frequently in a document it would also have a very high term frequency now document frequency is just how often a term appears in all of the documents in your entire index so here's where things get interesting so the word space probably doesn't occur very often across the entire index so it would have a low document frequency however the word the does appear in all documents very frequently so it would have a very high document frequency so when we divide term frequency by document frequency and that's the same as multiplying by the inverse document frequency mathematically we get a measure of relevance so we see how special is this term to this document it measures not only how often does this term occur within this document but how does that compare to how often this term occurs in documents across the entire index so with that example the word space in an article about space would rank very highly however the word the wouldn't necessarily rank very highly if that's a common term found in every other document as well and this is the basic idea of how search engines work if you're searching for a given term it will try to give you back results in the order of their relevancy where relevancy is loosely based at least on the concept of tf idf got it it's really not that complicated so how do you actually use an index in elasticsearch well there's three ways we can talk about one is the restful api now if you're not familiar with the concept of rest queries let me explain it at a very high level it's just like how you request a web page from a web server from your web browser and your desktop so when you're requesting a web page on your browser like from chrome or whatever you use what's happening is that your browser is sending a rest request to a web server somewhere and every rest request has a verb like get or put or post and some sort of body that specifies what it is that you want to get back so for example if you're looking for a web page you would send a rest query for a get verb and then that get would request a specific url that you want to retrieve from that web server now elasticsearch works exactly the same way over the same http protocol that web servers work across so you know this makes it very easy to talk to from different systems so for example if you were searching for something on elasticsearch you would issue a get request through a rest api over http and the body of that get request would contain the information about what it is you want to retrieve in json format and we'll see examples of this later on but the beautiful thing about this is that if you have an a language or an api or a tool or an environment that can handle http requests like just talking to the web normally then it can handle elasticsearch you don't need anything beyond that if you understand how to structure the json requests for elasticsearch then any language that can talk to http can talk to elasticsearch and most of this course is going to focus on doing it that way just so you understand how things work at a lower level and what elasticsearch is capable of under the hood but you don't always have to do it the hard way if you are accessing elasticsearch from some application that you're writing like a web server or a web application or whatever it is often there will be client apis that provide a level of abstraction on top of those rest queries so instead of trying to figure out how do i construct the right json format for the type of search that i want or inserting the kind of data that i want there's a lot of client apis out there that make it easier for you and just have specialized apis for searching for things and putting things into the index without getting into the nitty-gritty of constructing the actual request itself so whether you're using python or ruby or perl or c plus plus or java there are apis out there that you can just use finally there are even higher level tools that can be used for analytics and one that we'll look at in this course is called kibana it's part of the larger elastic stack and that is a web-based graphical ui that allows you to interact with your indices and explore them without writing any code at all so it's really more of a visual analysis tool that you can unleash upon pretty much anyone in your organization so in order of low level to higher level apis there are restful queries that you can issue from whatever language you want you can use client apis to make things a little bit easier or you can just use web-based uis to get the information you need as well so those are the basic concepts of how elasticsearch is structured and how you interface with it with that under our belt we can talk more about how it works under the hood and how its architecture works let's talk about elasticsearch's architecture and how it actually scales itself out to run on an entire cluster of computers that you can scale up as needed so the main trick is that an index in elasticsearch is split into what we call shards and every shard is basically a self-contained instance of leucine in and of itself so the idea is that if you have a cluster of computers you can spread these shards out across multiple different machines and as you need more capacity you can just throw more machines into your cluster and add more shards to that entire index so that it can spread that load out more efficiently so the way it works is once you actually talk to a given server on your cluster for elasticsearch once it figures out what document you're actually interested in it can hash that to a particular shard id so we'll have some mathematical function that can very quickly figure out which shard owns a given document and then it can redirect you to the appropriate shard on your cluster very quickly so that's the basic idea we just distribute our index among very many different shards and a different shard can live on different computers within your cluster let's talk about the concept of primary and replica shards this is how elasticsearch maintains resiliency to failure one of the big problem that you have when you have a cluster of computers is that those computers can fail sometimes you need to deal with that so let's look at this example we have an index that has two primary shards and two replicas so in this example we're gonna have three nodes and a node is basically an installation of elasticsearch usually you'll see one node installed per physical server in your cluster you can actually do more than one if you want to but that would be a little bit unusual to do but the design is such that if any given node in your cluster goes down you won't even see it as an end user you know you can handle that failure so let's take a close look at what's going on here in this example i have two primary shards that means those are basically the primary copies of my index data and that's where write requests are going to be routed to initially that data will then be replicated to the replica shards which can also handle read requests whenever we want to so let's take a look at how this is set up elasticsearch figures this all out for you automatically it's kind of like what elasticsearch gives you so if i say i want an index with two primaries and two replicas it's gonna set things up like this if you give it three different nodes so let's look at an example here let's say that node one were to fail for some reason you know it had some disk failure the power supply burned out who knows could be anything so in this case we're going to lose primary shard 1 and replica shard 0 but it's not a big deal because we have a replica of shard 1 sitting on node two and another replica sitting on node three so what would happen if node one just suddenly went away is elastic search would figure that out and it would elect one of the replica nodes on two or three to be the new primary and you know since we have those replicas sitting there it's fine you know we can keep on accepting new data and we can keep on servicing read requests because we're now down to one primary and one replica and that should be able to get us by until we can restore that capacity that we lost with node number one similarly let's say i don't know no number three goes away in that example we lost our primary node zero but it's okay because we had a replica sitting on node one and node two and elasticsearch can just basically promote one of those replicas to be the new primary and it can get by until we can restore the capacity that we lost so you can see using a scheme like this we can have a very fault tolerant system in fact we could lose multiple nodes you know i mean node 2 is just serving replica nodes at this point so we could in fact even tolerate node 1 and node 2 going away at the same time in which case we'd be left with a primary all on node 3 for both of the shards that we care about so it's pretty clever how that works you know there are some things to note here you know first of all it's a good idea to have an odd number of nodes for this sort of resiliency that we're talking about but it's pretty cool right and the idea is that you would just round rob in your request as an application among all the different nodes in your cluster it would spread out the load of that initial traffic let's talk a little bit more about what exactly happens when you write new data or read data from your cluster so let's say you're indexing a new document into elasticsearch that's going to be a write request now when you do that whatever node you talk to will say okay here's where the primary shard lives for this document you're trying to index i'm going to redirect you to where that primary shard lives okay so you'll go write that data index it into the primary chart on whatever node that lives on and then that will automatically get replicated to any replicas for that shard now when you read that's a little bit quicker it can just route it to the primary chart or to any replica of that shard okay so that can spread out the load of reads even more efficiently so the more replicas you have you're actually increasing your read capacity for the entire cluster it's only the bright capacity that's going to be bottlenecked by the number of primary shards you have now this kind of sucky thing is that you cannot change the number of primary shards in your cluster later on you need to define that right when you're creating your index up front and here by the way is what the syntax for that would look like through a rest request we would specify a put verb on our rest request with the index name followed by a setting structure in json that defines the number of primary shards and the number of replicas okay now this isn't as bad as it sounds because a lot of applications of elasticsearch are very read heavy you know if you're actually powering a search index on a big website like wikipedia or something like that you're going to get a lot more read requests from the world then you're going to have indexes for new documents so it's not quite as bad as it sounds in a lot of applications oftentimes you can just add more replicas to your cluster later on to add more read capacity it's adding more write capacity that gets a little bit hairy now it's not the end of the world if you do need to add more write capacity you can always re-index your data into a new index and copy it over if you need to but you want to plan ahead and make sure you have enough primary shards up front to handle any growth that you might reasonably expect in the near future we'll talk about how to plan for that more toward the end of the course by the way just as a refresher let's also talk about what actually goes on with this particular put request for defining the number of shards so in this example we're saying we want three primary shards and one replica how many shards do we actually end up with here well the answer is actually six so we're saying we want three primary shards and one replica of each of those primary shards so you see how that adds up we have three times one is three plus the three original primaries gives us six if we had two replicas we would end up with nine total shards right three primaries and then a total of six replicas to give us two replica shards for each primary shard so that's how that math works out there it can be a little bit confusing sometimes but that's uh that's the idea but anyway that's the general idea of how elasticsearch scales and how its architecture works important concepts here are primary and replica shards and how elasticsearch will automatically distribute those shards across different nodes that live on different servers in your cluster to provide resiliency against failure of any given node it's pretty cool stuff all right like they like to say in my kids schools it's time to show what you know it's quiz time don't worry not too hard just want to make sure you were awake during these past few lectures first question the schema for your documents that is the definition of what sort of information is stored within your document is defined by what the index the type or the document itself where is the information stored as to the actual schema of the information that a document contains the answer is the type so the type is basically again the equivalent of a table in a database it defines the individual fields and what data types they are that a document contains so again going back to the example of an apache log entry that type might define things like the url that was requested or the status code and the request time and the referring url and things like that or if we're storing something like wikipedia entries it might just include things like the text of the article itself the author of the article the title of the article things like that that is all defined by the type of a document and again we define types in elasticsearch by defining what's called a mapping when we're setting up our indexes question two what purpose do inverted indices serve in a search engine this isn't just specific to elasticsearch it's in general for search engines do it does an inverted index allow you to search phrases in reverse order do they quickly map search terms to the documents that they reside within or are they a load balancing mechanism for balancing search requests across your entire cluster what do you think the answer is if you said to the second one you're right they quickly map search terms to documents so remember an inverted index simply maps out specific search query terms to the documents that they live in so as you index documents an inverted index is actually created where it splits those documents into search terms and has a very quick look up of where to find those search terms in given documents next question if i have an index configured for five primary shards and three replicas how many shards would i have in total a little bit tricky here think about that for a little bit it's the answer 8 15 or 20. i can think of ways of doing the math where i could get any of those answers but the correct answer is hit pause if you still have to think about it don't no cheating the answer is 20 shards so the way that works is that i have five primary charts so i start with five shards and then i want three replicas of each shard as well so three times five is fifteen so i end up with five primaries and 15 replica shards for a total of 20 shards in this particular example remember how that works because they can add up fast and remember a given node can actually contain many different shards so you know it will distribute shards among nodes in whatever way makes sense on your cluster automatically just because i have 20 shards does not mean i need to have 20 machines in my cluster we can have many shards on a given node next question elasticsearch is built only for full text search of documents true or false well you got a 50 50 shot on this one but if you've been paying attention at all you know the answer is false elasticsearch can index any sort of structured data with any kind of mapping you can dream up so it's not just for full text search anymore it's not just for searching encyclopedias and websites and blogs it can also be used for searching and even aggregating and visualizing numerical data or time-based data or whatever you can dream up and increasingly it's being used as a tool for example for aggregating web logs from web servers and sort of building a system that can compete with google analytics and things like that so elasticsearch is not just research anymore hey i hope you enjoyed this video and maybe learned a thing or three if you'd like to learn more with me please check out the link in the description below i have a whole larger course on data science deep learning and machine learning with python there it's very popular and highly rated so i hope you'll check it out and you'll also get a really nice discount on it by using that link in the description there anyway thanks for watching and i hope to see you again soon
Info
Channel: Udemy
Views: 258,126
Rating: undefined out of 5
Keywords: Frank Kane, elasticsearch, elastic stack, elasticsearch tutorial, elasticsearch tutorial for beginners, elasticsearch in depth, Sundog Education, free elasticsearch tutorial, elasticsearch architecture, Elastic Stack tutorial, elasticsearch basics, elasticsearch query tutorial, elasticsearch course, elasticsearch training, Logstash tutorial, what is elasticsearch, Kibana tutorial, beats tutorial, sundog education course, frank kane udemy, udemy frank kane, Udemy, elk edureka
Id: C3tlMqaNSaI
Channel Id: undefined
Length: 26min 1sec (1561 seconds)
Published: Thu Dec 27 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.