Use Logstash to load CSV into Elasticsearch

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi there welcome to this lecture now we're going to get even more practical and I'll show you how to work with real data ingest data into elasticsearch and for that we're going to be using this tool called log stash and this is a typical stack they refer to it as where is a combination of elasticsearch log stash and Cabana elasticsearch is the core right that's the engine where everything is stored and searched and Cabana is the user interface that we've been using to interact with elasticsearch log stash is the middle guy it's supposed to sit between your application or your data source and elastic search so it facilitates the ingestion of data into your search engine into elastic search so you can have applications generating logs and those logs can be ingested into elastic search real time using the data pipeline that logstash facilitates so in this course so far we were using the put command to index documents into elastic search right and that was a very manual process and we were just doing that to test out the features of elastic search and figure out how how it works but using logs - we can get practical and in just real data into our cluster so let me actually take you over to the website for elastic search and this is the stack that I'm referring to the elk stack so kibana you know we've been using it to interact with elastic search this is the front end tool and elastic search this is the core right search analyze and store your data and then down here there are a couple of other tools which I you know I consider these auxilary tools to make your life easier it's enough to just know elastic search and combine these two are very important especially elastic search but these two are auxilary tools that can help you work with elastic search logs - right is a dynamic data collection pipeline the perfect use case for logs - is application logs right you've got applications running and every minute that ticks by it generates logs and that data could be ingested into elastic search using log stash and a lot the examples on the elasticsearch website have to do with you know stuff like Apache logs and time stamps and so on so I won't be using that as an example we're going to go into an actual data set and this is going to be a real-world data set that you can analyze and and run your aggregate queries on and perform searches and so on so navigate over to a website called Kaggle comm this website provides real world data that you can practice your data science skills with and if you click on data sets here you'll find real world data that people have uploaded right how sales in King County USA Texas death row executions info and so on so the new data gets uploaded onto this website every day and some of these data sets are very very large but I'm going to pick something that is in the middle you know not too much data that your computer crashes and also healthy enough to be a practical example for this course I'll pick a data set that we can all relate to for example cars right here in the search box just type in cars and you'll see this data set right here classified ads for cars so let's select this and here it just provides some information about what kind of data this is in Czech Republic in Germany over a period of more than a year above all these cars car sales were recorded and the content of the file that we're going to download is right here there's these are the different fields that are going to be part of this file the transmission the door count fuel type the price of the vehicle and euros the maker the model and so on the mileage or kilometers and you can download this file by clicking on this download link right here it's a 92 megabyte file so it's pretty large and it's going to be in CSV format if you scroll down this is the name of the file and notice it's dot CSV meaning it's comma separated values so each column is going to be separated by comma and then you know these are the different rows if you want to preview what this file looks like you can you know scroll down and look at the data to get an idea but we're going to be opening this up in Excel hopefully if this is not too large so let's just click on download and it will start downloading the file now it's 92 megabytes so depending on your internet speed it might take some time I've also provided this file as part of the course in case you know they change the links or the URL links on this website and this file is no longer available that will throw things off for you so I've actually provided this file as part of the course so anyway we've downloaded the file to click here it'll you know we can check out where it's been downloaded to so I'm actually going to move this file to my home directory which is right here empty as a mod so let's head over there and this is that file I'm going to rename it to something easy let's just call it cars ok and I'm going to create a folder here go to a new folder and we'll call it data and I'm just going to move this file into this data folder ok so let's open up this file it might take your computer some time since this is a pretty large file I think it's a it should be over a million records if it's 92 megabytes and it's comma separated values it might be well over a million records but we'll see so it may take a minute or two for you to open this file I'm just going to sort of zoom ahead to pause the video and zoom ahead to the point where it's time to open this file up in Excel that's the default application in my computer if you don't have Excel you can open this in notepad or whatever text editor it is that you're using for your operating system Excel application it recognizes CSVs and it's able to put them into a table format okay so it seems like this file it says it's file not completely loaded right but that doesn't mean Excel will not open it if I hit OK it will open what it can and the reason for this is this seems to be a very large file if you go all the way to the bottom let's scroll all the way to the bottom and there are well over a million records in this file and Excel has a limitation of this many records it can't open more than that so you know that doesn't mean that we you won't have access to the entire file will have access to all the records in elasticsearch remember SiC search is capable of big data but Excel is not meant for big data so that's what this is all we can see the first million or so records but anyway let's take a look at what kind of data this file has so we have different manufacturers here than make Ford BMW Toyota and Suzuki and so on these are the brands of the vehicles and then we have the model for their given vehicle and then we have mileage ok and then the year in which the car was manufactured and then these are some other fields your engine displacement I don't know what that is I'm not a big car person so and an engine power I believe this is probably the horse power body type color and slug not sure what those are but here we have the transmission it's either manual transmission for ma and man or auto automatic the door count seat count and so on the fuel type is diesel or gasoline and the date that this vehicle was listed you know the date created it's in this column and then finally we have the price for each one of these vehicles ok so this looks like a healthy data set that we can do so much analysis on so I'm happy with this let's close this file and we don't need to save the configuration here this is a comma-separated file and I saved it in the data folder of my machine right wherever you save it just make sure you remember where it is because in the configuration for logs - we're going to have to specify where log stash is supposed to read the file and index it into elasticsearch we need to give those configuration details so let's head over back to the elasticsearch website here and navigate over down to the logs - and let's download this application and it gives the instructions down here your installation steps download and unzip the log stash file prepare the log file the config file I'll show you how to do that and then we run the log stash application using the configuration file that I'll show you and there's also a video that you can watch and getting started guides so I'd recommend you read through those if you're interested in delving deeper into log stash but don't do that right now let's to anyone with the course you can come back to that later if you're ever interested there's some great documentation on this website let's go ahead up here and download the file I'm going to get the tar file right here and it's going to take a few minutes to download depending on your internet speed and there you go I think it's done let's head over to that so this is that file let me copy it and paste it in my home directory and I'm going to show you how to unzip or or on tar this file is a special command in Linux if you're not familiar with it don't worry I'm gonna go over it right now let's open a new tab and navigate over to the home directory type in CD and that's going to take you automatically to the home directory and this is that file alright and we also have our data folder in which I loaded that CSV file so to extract or untargeted and it's called tar - x VF and then the file name then just hit enter and it's going to unzip that file and notice in our file system it generated this folder so it extracted the contents out of here into this folder so that's it that's extracted now we need to configure the logstash file now really quick before we configure logstash to work with our elasticsearch cluster let's go up to this learn tab I just want to visit the documentation for logstash real quick and there's this documentation let's just click on get started and look for logs - right here logs - reference 5.4 so let's visit the configuration section and this is what I want to cover real quick when you kick off logs - you do that with the intention of loading some data and the structure of the file right the data that we have for example for cars we have the different columns and their data types so this configuration file needs to be set up with the columns and their data types if you want to configure them specifically and then you have to specify which elasticsearch instance the file is going to be loaded into so there are basically two major steps there's the input in which we can state that we're going to need to file and I'll show you how to do that and there's the output which is going to be elasticsearch cluster and we specify the host important so on and when you run the application log stash you do it like this you specify in the bin directory log stash and then minus F and then you state the configuration file that you want to use so that log stash knows where to get your file and how to index it into elasticsearch look at structure of a config file and these are the major parts the input the filter this is where we specify the different fields the columns of the file and then the output this is where we're going to state the elasticsearch instance details okay and this has its own proprietary language that looks like this it's not really JSON it looks like JSON with it's not JSON exactly because notice the fields don't have quotations around them and we specify the path but I'm going to show you how to configure this logstash to work with our file and to index it into elasticsearch so let's do that right now i'm going to open up a text editor i'm using sublime text and i'm going to save this file we'll do file save as and i'm just going to save this file in my home directory in the data folder that we created right in the same place where we have the actual csv file so i'm going to call this log stash dot config and you can name this file whatever you want as a matter of fact let me make it more specific to cars because that's the kind of configuration we're going to be doing in this file so the logstash underscore cars dot config and let's just hit save and here we're going to specify the input what is going to be the input where is going to be the output right which cluster of the file is going to be indexed into so the first thing we need is input and inside of here we specify what kind of input so there's going to be a file and then inside of here we specify the path and we use these this is a syntax that comes from ruby it sort of like a map this is the key and the value so we state the value in a string and that's going to be the location of where that file is so i saved it in two users slash mt as a mod that's my home directory slash data folder and then the file name is cars dot CSV and the next thing we need is start position and this is just going to be the place where you want to start indexing from within the file I'm going to put the beginning by default it picks from the end of the file because you know applications typically keep logging to the same file and they do that at the end of the file but since this is just a already completely generated file it's not going to be updated or anything while we're doing indexing we're just going to pick the beginning of the file and then there's this other thing called since DB underscore path don't worry too much about what this is but it's basically the ability to you reuse this configuration file to index the data again and we're going to need to put dev slash null they're not important to go over what that is the next thing is filter and this is where we need to specify the details of the different columns and their data types so what kind of filter is going to be relevant to our CSV file and then we specify the separator that separates the different fields right the columns is actually just a comma this is a comma separated file after that we need to specify the different columns and the columns go into this array now I've already pasted the columns elsewhere so I'm just going to copy them and paste them into this array we've got maker model knowledge manufacturing gear and so on all of those things all of the fields that you saw at CSV file that we downloaded so this is the CSV section right here right after that we need something to specify the data types by default logstash treats everything like a string character data for example maker we know it's going to be textual data it is the manufacturer of the model but stuff like mileage this is an integer right we'd like to specify that this is numeric data and the manufacturer year we can leave that as a string important but stuff like you know price and the seat count and door count these are numeric fields so we want to specifically state that these are numeric fields so that we can do aggregations and statistics on these fields so there's a keyword called mutate and in here we specify the word convert and then we have an array in which we specify the two fields and here we're going to specify the column so let's pick mileage and type for mileage is going to be integer and I could do that for the various fields so I'm going to copy it and paste it a couple of times and we're going to populate it with the different columns so price euro this is going to be a float data type because we know we have decimals for the price the next thing we have is you know for example engine power which is right here engine powers could be referred to as a horse power so that should also be an integer door account is another one so let me paste that here that's also an integer and the seed count that's the last value here the last column that's also an integer the rest of the columns that we didn't specify and mutate those are just going to be treated as string data right textual data so inside of CSV we got the separator and we get the columns and then outside of CSV within the filter section right here we have the mutate so we got the input we got the filter how we want to convert the data and then finally we need output and output is just very easy this is going to be elastic search now logstash is capable of so much you can index these documents into MongoDB and other types of no sequel data stores but we're using elastic search of course so we're going to use that and then in here we specify the different configurations let me just make some room so that we can see so the first property here is going to be a host and here we specify the URL for our cluster and I'm just going to pick local host and we can specify the port as well but we don't have to do that if it's if we're using the default 90 200 next thing we need is the index so which index is this file going to be indexed into I'm going to call it cars to keep it simple and then finally what is that what is the document type each document is going to be a particular car sold right so we can say document type and it's going to be sold under score cars just like that the mapping is going to be created for this guy sold cars and it's going to go into this cars index and each document that gets indexed each one of those rows that you saw in that CSV file each one of those rows represents a document that is of type sold cars that is going to make its way into the cars index okay and then after this we can just put STD out which is sort of a Linux term meaning that you can output to the console to the terminal when things are running you can just make it print to the console when logstash is loading our file let's save this file make sure we saved it in the same directory as data so let me just make sure in the data directory this is that file logstash cars it's not necessary that we keep this file in this data folder and you can have it anywhere in your computer as long as you can refer to this to the configuration file when we execute logstash but i just want to keep things organized and I think it's a good place to keep the configuration file and the data file together so in the next lesson I'm going to show you how to execute this file with logstash you've got everything set up and it's just a matter of indexing that large document into elasticsearch using this configuration so I'm going to show you how to do that in the next lecture and then we're going to look at some really cool features that we could do with kibana such as create a dashboard create different charts and while this data is loading and you can see in real time how the charts change and so on so it's going to be some cool stuff it's going to be taking place in the next lecture exciting right now is a good time to end this lecture I will see you in the next one thanks for watching

Info

Channel: Imtiaz Ahmad

Views: 151,011

Rating: undefined out of 5

Keywords: logstash, elasticsearch, kibana, Machinelearning, elk, elk stack

Id: rKy4sFbIZ3U

Channel Id: undefined

Length: 19min 54sec (1194 seconds)

Published: Tue Jun 13 2017