Kafka + Spark Streaming + Hive -- example

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to this video on Kafka and spark streaming in this video it's going to be a proof-of-concept hands-on example I'm going to walk through installing Kafka Hadoop hive and spark writing some code and making a work an example of spark streaming from a Kafka source in this video I'm not going to go over the theoretical aspects of Hadoop or Kafka or spark instead it's just going to be a hands-on example so you should have a basic theoretical understanding of how Kafka works as well as maybe the Hadoop environment specifically for Kafka you should know just the basics of producers and a cluster and consumers we're going to be working with a producer the cluster and the stream processor which will be sparked in this example we're going to collect a stream of tweets from Twitter will publish the tweets to a Kafka cluster pull the tweets from that cluster do a transformation on the tweets which would be doing a character count on a word count for each tweet and save that data to a hive table so the steps to do this we're going to set up a virtual machine we're going to install Kafka Hadoop hive and spark we'll write a Kafka producer and a spark transformer that will produce a stream and read the stream and then we're going to make a work and watch it go so throughout this tutorial I will be referencing this github readme that's not complete yet but will be soon just for large configuration files I'll be copying and pasting just to make it a little easier on the video so as you can see here our first step we gotta set up a virtual machine this is possible to do on your actual computer but I think it's easier to use a virtual machine and if you're following along you can make sure you're doing each things step-by-step correctly so if you haven't already you should get VirtualBox it's a free way to host virtual machines after that you should choose a Linux distribution of your choice and download a disk image I chose loop unto you just download I'm gonna cancel that okay pulling my downloads right here you need a ISO image so wherever whatever you want to download just go make sure you get an ISO that's the easiest way to do it I think so as you see already have my downloaded so I'm going to start up VirtualBox are here and to make a new virtual machine will click new we will name it Lubuntu I'm going to give it four gigs of RAM you could get by with a little less maybe by thing four gigs is pretty good we are going to create a virtual disk right now VDI dynamically allocated I'm going to give it 15 gigs there we go so we haven't made but it's kind of just the hardware right now there's not nothing's gonna happen if we run up so we will click settings storage click on this little disk this is kind of like the disk drive of a virtual machine click on this disk to choose a virtual optical disk file go to your downloads so here is my lab on to desktop ISO open that up click ok and so now you see right here we have our disk image in our virtual machine so we're good to start it up before we do that let's just check a couple more settings we do want to make sure we have internet access and by default it's set to the NAT adapter which does give us internet access so you can just check that we're going to check the internet access after we get started and I'm not sure where this setting is there it is so if you go to general and advanced make sure you have shared clipboard' bi-directional that allows us to copy and paste between our host computer and a virtual machine which will be very helpful so now that we have those settings on we can start up a virtual machine and choose your language start la bonne - so I think with this release of Lubuntu it's going to be going to give us like a live version and then give us the option to install it's up to you I think it is possible to run the live version I'm going to install the bun - because I'm plan on using this virtual machine for other projects so here it is right here I'm going to install the bun - 19 next next so we're going to erase the disk which doesn't mean erasing your disk on your computer it's just saying this virtual disk that's of 15 gigs we're going to erase that so the but you can install completely let's name this thing I'm just kind of choose a password I can remember and looks like that's the go ahead button alright so I'm going to speed this up alright so the installation is complete so we're just going to restart now which looks like it's that one right there alright so let's login so you may have noticed that even if you try to adjust your VirtualBox window the actual screen of the virtual machine is small and that's going to be really annoying if you're going to be working on this little time so one of the first things we should do is go up to the top where you have your device's menu or if you're using a if you're using a PC you should just be right here about the window I just go to devices and go insert guest additions CD image and so this should pop up so we're just going to cancel that because it's easier to access it through the terminal so open up a terminal and go into media then the next directory should just be your user name and then after that it should just be the only thing there which is VBox guest additions CD so in this directory you should see a file called V Box Linux additions so we're going to run this we do need sudo and they give it a dot slash V Box Linux additions and let that run all right so it looks like it worked we can now extend our window and it should fit better there we go I'm gonna make it just a little bigger looks like we're a little too big okay awesome all right so throughout the remainder of this tutorial we're going to be working just out of the home directory so just a little tilde I mean you can't put things where you'd like but I think it's easier just for the sake of a proof of concept just to keep things all in the same place before we continue I'm just going to simplify things a little since we are working in this home directory I don't really want to be looking at like a Documents folder all these folders so oops so let's just remove those there we go we can keep our desktop and downloads lastly I'm going to just increase the font on this a little 17 okay like a little margin - there we go I should be nice and easy to read and this is optional as well but I'm going to just customize the prompt a little okay so that was the batch RC we will edit that later so I guess it's good practice we do need to run that with source there we go so I just like the this would be a little smaller just kind of tells us where we are so still helpful but just a little easier to read but like I said it's completely optional alrighty so you may have noticed that when we installed the guest additions for the virtual machine that allowed us to go fullscreen there was a warning that said some features might not be available until we restart the VM one of those is the ability to copy and paste from the host machine to the VM that's going to be very important so before we continue we do need to restart the VM okay so we have a VM setup and we're ready to install the kafka which is our next step on the list so get back to your VM open you virt open a terminal and the first thing we have to do is just go to the Apache Kafka download page so if you just search Kafka download I'm going just here to the QuickStart link you can go to the download page as well just click download there's this tar ball right here you can use W get or you can just click on it and just save file I already have it downloaded so it's in my downloads directory right now so let's look at right here in my downloads I have kafka so first thing we need to do is just unpack it so downloads Kafka and that will just unpack it right here to our home directory so one thing we're going to do is just rename it so it's easier to work with awesome so we have a Kafka we can jump into and look at it looks like there's a folder with the binaries which is the main thing that and the config is what we're going to work with if you haven't already looked at the Kafka QuickStart I think it's a good starting place for working with Kafka we're going to use just the basics from this QuickStart and then build off it with our spark streaming so as noted in the the QuickStart you do need Java to run Kafka as well as Hadoop and a lot of the other Apache environments so let's do that real quick loulou sudo apt install open JDK eight so I've already installed Java so that's what happened really quick I did it earlier as well as downloaded a lot of the tarballs earlier because my Internet's pretty slow so this might take you a second but just make sure you download that once you have a download you can check just like Java or you can do Java version and that looks good so we have Java we should be able to run Kafka so let's jump back into that so before you run Kafka you do have to start with instance of zookeeper so let's go bin zookeeper server start and give zookeeper those ooh keeper configuration settings so now we have zookeeper running and one nice thing I like about this desktop environment or at least this terminal is you're able to name it so I'm going to name my terminals or at least my tabs move this one named zookeeper so in another window go back into Kafka and now let's start the Kafka broker so that's Kafka server start and we will just use the default server properties let that go we'll just name this the broker so we're just doing a 1:1 broker cluster just for simplicity but it looks like our zoo keeper and broker communicated so okay so now that our brokers running we need to create a topic so the topic are the things that the producers and the consumers subscribe and publish to so we can first just check what topics if any exists on our cluster so it looks like there were no topics yeah which is what we expected we just really started up so we're going to create one so we have to tell what server to look for and that's just the one we just barely started 909 2 is the default port defined in those server properties that we passed in so we're going to set the replication factor to 1 because we're just using a single node kafka cluster and while we will also just do a single partition and now we can give our topic a name so we are going to use some tweets as an examples so we're going to like stream in tweets and pull those out from the spark transformer so let's just call our topic tweets alright so it looks like that worked we can check with that last command I'm just going to list it yep so we have our tweets topic that is ready to produce messages or published messages to so we can actually just test that out real quick so again this is all covered in the Kafka QuickStart but if we do console producer and say the broker list is our server and set the topic to tweets that will be able to produce tweets and just so we can get a side-by-side I'm going to open up another terminal that's sure in this so on this side I'm going to make the consumer so now this consumer is listening for any changes I'll just do that looks like everything's coming through awesome so we know our zookeeper and Kafka cluster are working we can close that we all need it runs a little too big all right so we're going to keep this broker running and it will keep the zookeeper running we're going to get out of this producer though so that's mostly it for Kafka one last thing we want to do is for this example we're going to use Python as the kafka streamer and also we're going to use PI spark the Python implementation of spark to write the transformer so depending on your distribution you might already have Python 3 installed labonte already has it so if not you will need to get that but one thing that doesn't come installed already is the Python package manager so you will need to do sudo apt install Python 3 pip I already downloaded it like I said so once you get that you can we will need to do pip 3 install kafka Python so this is the Python library that allows us to work with Kafka so once you install that you can check that you have it with just by doing a pip 3 list grab Kafka and it looks like we have it there you can double check that by just entering the Python shell and importing Kafka and if you don't get any errors and it looks like you're good to go so we're not going to start using it yet we'll use that after we get all the environments installed but while we're in supplying Kafka I thought I'd mention that so let's check on our progress we just installed Kafka so next up we're going to install Hadoop step is to install Hadoop we're just going to hop back right into our VM and just like we did last time let's just open up a browser and just search a dupe download probably just the top links fine all right we're going to use version two point eight point five so just click here on the binary this should to a list of different mirrors we can use and again we could copy this and use W get or you can just click on it and just download it through the browser that's what I did I already have it downloaded so do that let's check out our downloads folder looks like we got it right here I do two point eight point five so let's unpack it and this will take a little longer to impact on Kafka because it's just a little bit bigger all right so it looks like we have it right here in our home directory I'm going to rename it just so it's easier to work with so we have Hadoop right in our home directory so one of the first things we need to do is edit our batch RC I'm going to use vim you can use Nano or one of the GUI based editors that come with your Linux distribution so let's just ignore that warning all right here we go so I'm just at the bottom of the file so there's a couple things we need to do and I don't want to write them out all the way so I'm just going to go to the github readme and just copy it there we go alright so we do need to change a little thing one little thing right here is so our Hadoop home is just gonna be the home directory which is home slash whatever your username is and then Hadoop so I did rename it to Hadoop so that's why I did that plus it just makes it simple and then all these other configuration directories are usually some variation of Hadoop home but different processes use them depending on their name the other important thing right here is the path so that's letting us use all the commands and the s bin and the bin inside Hadoop one last thing is this Java home this should be the default location that Java installed too but depending on which version of Java you downloaded it might be a little different so let's just double check that let's open up a new window go to user Lib JVM looks like it's that Java 8 all right so check that use the Lib JVM Java 8 open JDK looks like that's the one perfect okay so we're good on that let's just save it and then we need to run it just to make sure it changes to take place we can test it out echoing the I do poem yep awesome okay so the next step is let's get into the Hadoop configuration which inside your Hadoop folder there's this Etsy inside that just another Hadoop folder so here are some of the configuration files we're going to edit this one right here the Hadoop end so let's open that one up so these two right here there's a Java home and Hadoop copter I'm just gonna comment these out because we're going to copy and paste them in just make some space I'm going to jump back to the readme a little too far sorry right here and then again replace the username with your name so the same thing is about just make sure this is your Java home and this should just be wherever your Hadoop is okay so we can save that one so while we're still in this configuration directory we need to edit the HDFS site XML so I'm just gonna override everything by deleting it first and then grabbing it looks like I did a little out of order okay we're gonna do an HDFS site right now so let's copy that one okay and then the other one was the core site same things gonna overwrite everything grab this so I forgot if I've mentioned but we're just going to use Hadoop as a single node cluster because Hadoop is designed to be run over a lot of multiple machines but just for our example we're just going to be using one and so for that to happen there will have to be will mimic it with allowing this machine to SSH into itself so let's install SSH I'm sorry open SSH server and open SSH client like I said you might already have these depending on you distribution but user installs pretty quickly all right so we can check so it looks like we do have SSH running so we can log in her SSH into ourself by just doing SSH localhost and doesn't look different because we're kind of it's like the same thing right so we're gonna actually exit out of that get back to where we started to make things easier we can use password lists authentication just by running this these two commands or we just create an SSH key and then we'll see the SSH key so try that last command again so fam or SSH it into the computer itself which will allow us to mimic the most like a Hadoop cluster okay so now that we have that we have Hadoop setup and now we should be able to start the Hadoop cluster so Marinis HDFS that's a command that was found and the Hadoop bin that we added to the path so it should be able to recognize it from here so we're going to just say HDFS name no to format and you'll see something like that and then the next step is to start the distributed file system so it's just start DFS that Sh again that's in the bin so that should just run just say yes on that allow it to SSH okay so we don't really know if hoop is running yet so we can test that by just doing Hadoop file system LS / so we're just going to look at the contents of the Hadoop directory or the Hadoop file system so nothing returned which is usually a good sign if something was wrong we'd seen error but let's just make sure by making a directory we'll just call it test and now let's look at our contents of the HDFS yep so our test folders there on the Hadoop file system so Hadoop is installed and we are doing good I'm sorry okay so let's check our progress go ahead open stomp so our next step is hive let's just run through that real quick as you might have expected we first just got a spot I'll show you where the install is so I think hives page is a little this is giving me issues okay so you can go through here just 2.3.5 so the one I have is using you can also go back through the the Apache archive and find 2.3.5 just download the binaries same thing as before you can use W get on this or download through the browser I already have my downloaded so I'm going to unpack it Pachi hide there it is alright let's rename it alright we have - OOP and Kafka all in the same place so let's jump into the batch RC I'm not sure that swap things about I probably ever running somewhere and [Music] let's export the hide pull that it's going to be where we just saved it which is home username hi and then let's add that to the path so I'm just going to follow the syntax from up above or we just say path and then just add on at home and then we want the bin directory because that's where all the commands are okay alright so we need to run it so now those um changes took place so now we should be able to hi to do that we should be able to do high version all right so we got high of two point three point five which is what we downloaded so that's looking good so to use hide we need a I mean hide runs on top of the Hadoop file system so we need to make some directories the hide can use so let's do some more Hadoop class system commands so Hadoop FS make dirt you can use the PFLAG user warehouse so that p flag will make all those directories for us let's also make a temp directory we don't really need the p flag on that one because it's just one directory now let's change the permissions on those user hi warehouse and then let's do the same for the temp directory alright so those are necessary for Hadoop - right - alright let's jump into hive and go into the console der and change some of these configurations so first it's not here oh it's right here it's just like a template that might explain some of the variables were just gonna make a new one called hive MSH and i'm gonna copy from the github grab these oops we gotta change these names alright and then next up we need to make a hive site on XML and I don't think it exists we're gonna make it from scratch dot XML and we're gonna copy this this was really big so happy or not I have to type this in alright so one thing to notice here is that's the director we just made an HDFS that's the default location for the database then we have this medicine or URIs that's going to allow us to connect to hide from our our spark file that we're gonna write later alright so we have our hive site and we should be good to go one thing I'll just point it out real quick once we enter hive we get this annoying warning about this SL f j SL f4j file gets kind of annoying there's just multiple files so one way you can deal with that if you just go into hive and then if you just move the lib I think it starts with a log SL for J and just rename that so it'll still work fine there's just two of these files from our other from either Kafka or Hadoop that's making issues so just rename that you can get rid of that morning's not a big deal though if you don't all right and then let's create eight the database that we're going to be using so remember we want to stream once we grab tweets from Kafka we're going to save that data to the database so that database will be in high so we're going to use the schema tool do a knit schema and DB typists Derby so that's just a nut - just like that okay and then the last thing we're going to do so I thought that we're going to do hive service and meta story so when you start that meta story server so there it says we've started it it's not going to give us too many logs or anything but that's fine let's name that we will need to keep that running so let's open up a new tab so now we have that meta store running we can enter hive okay so we're now we're here in hi let's just do a little query show it databases should just say default yep show tables nothing let's just make sure we're using the default okay so let's create a database we're going to do create table we're gonna call it tweets and remember we're going to keep the text of the tweet we're gonna count how many words there are and then we're going to count the length of the total tweet so that's the schema we're going to give it a row format oops delimited fuels so this is kind of a tricky part where we have to decide what is terminated by a lot of times you can just do space if you know the kind of data you're getting but we do know that tweets have a lot of spaces so that's gonna no that's not going to work seeing you could use a tab but there's a good chance that tweets entered tabs I'm not too familiar with the restrictions on tweets so what I decided to do which might not be the best way is just kind of choose a random sequence of characters so I'm just going to do like a double slash pipe I there is a chance that people might have the sequence of characters in their tweet but I think that's that'll work fine so and then we just have to say stored as text file all right so if we show tables there's our tweets table and we're not under count so I'll just do it select star there is not gonna be anything in there because we just made it but as we go along there will be something in there and that'll be the fun part okay um we can just hold get out of the show so we don't get too clogged up so let's check in we have install pipe our next step is to install spark so let's just keep on going um let's see first step as always download Apache spark so choose one of these we've got a Hadoop 2.8 I think so we do want the one later these ones they are both stable so I'm just going to go for the later one and just download that so download through the browser through the console I already have my downloaded so you should get a download like this this Sparkman so first step to unpack it okay let's rename it so similar to how we needed Java before we started using Kafka we will need Scala before we start using spark so spark was written Scala and it does require it on the system to run so let's check if we have it we don't so let's install that real quick all right so let's check it Scott the Persian awesome I think it also comes with the Scala compiler although we're not gonna need that since we are gonna be writing our code in Python so that said you should already have pip3 from when we installed Kafka Python but we will need PI spark the Python implementation of spark so do app it free install PI spark I already have it so you can if you want to check you fit to your list yep so we have our PI spark and we have spark and we have Scala so one thing we need to remember is to add we go into the spark we have this the bin folder that has all those commands like PI spark and spark submit are probably the other one we're going to be using so let's make sure we add that to our bash RC so that's home username spark run it and give it a test so we can just start with PI spark so this is one issue it says I can't find Python depending on your distribution you might not have Python - so what we should have done back when we were in our batch RC was to make another edit where we export high spark Python and we just set that equal to Python 3 ok oops alright let's give that another try oops that's what happens when you don't run your batch artsy after all right so it says I tried using Python 2 first and kinda and then use that environment variable resets now so using Python 3 so here we go using Python version 3 we have spark right now this shell allows this it already creates a spark context for us and the sparks session which is something we will get into very shortly one last thing that we will run into if we continue but we might as well take care of right now is that there is a certain jar file that isn't that we need to download separately so I mean you'll get this error if you just continue without this and it'll tell you where to go I think it's search maven org there we go we want a spark streaming and then it's zero eight okay so we're gonna want to grab one from Apache and zero eight two point eleven that's the scala version can't tell the difference between those two oh oh goodness yes we want this assembly one and two point four three okay perfect so we'll just have this downloading like I said it would we'd hit an error where it says this library isn't included and then we'd have to ill tell us the link to go get it but might as well just get it right now all right so that is for the most part install spark unless we forgot something we might hit an error later that we can go fix but I think we got it so now our environment is set up and we can go on and then start to write our producer and our spark transformer all right so we're at our final step or we need to write the coffee producer in the spark transformer one thing I forgot to mention is we're collecting a stream of tweets from Twitter to use the Twitter API you will need a Twitter developer account which is easy to get it just takes a while so you might not be able to follow along completely but I did come up with an alternative so we can still use Kafka and spark streaming if you don't have the Twitter API so let's jump back to our VM and get started so remember we have our zookeeper running our Kafka broker running and our hide meta store so I was thinking that we could just write the code together but that's a little boring so I already copied the code in from the github and it's this files there's the files you need for example so we do have a tweet stream and for people without a Twitter API I made up a little like fake tweet stream and then we have the transformer so let's just take a look first at the tweet stream go to the top so we use that kafka Python package we installed at the start the seeker configure just where my API keys are one thing I didn't show is you really need to use pit to install the package tweet B so it's going to look at some topics on Twitter we're just going to look at all tweet stuff to do with pizza we have our Coffman broker our Kafka topic that we made earlier call tweets and you can just look at the Twitter API Doc's on this they give a really good example of extending their streamer and so we just say each time we get a tweet that matched what we are listening to I just cleaned up the tweet getting rid of the retweet and any links then we just connect to the producer the Kafka producer we sent to that topic the tweet and then we just encode it and then this is just for logging any output so we know that it's working so this is where we set up the API all the authentication we create our streamer and we just tell this stream to listen and down here is where we actually made the Kafka producer you just say producer is equal to the Kafka producer and then give it a list of those Kafka brokers to listen to alright so that's pretty simple as for the fake tweet stream depending on your Linux distribution you might have a word list and the user share folder so I just grabbed that and then connected to the Kafka producer like before with the bootstrap servers and then I just add in some random words and print them out so for example well first let's change the permissions on those so we can run them easier on the tweet stream so the fake tweet stream just kind of says a bunch of garbage and does it kind of a random in it does the kind of random intervals and with the tweet stream I don't think I have a praying out the messages sending tweets but those are actual tweets coming in we will be able to see that on the other side all right so that's it for the producer that's going to produce to the Kafka stream I mean sorry that's going to produce to the Kafka cluster and the transformer is going to take from it and put into hive so let's look at that next so this one we need to import spark spark context spark session spark streaming contacts and the Kafka utils this coffee utils is why we had to download that jar folder which let's check that we have that still yeah this jar right here so before you forget let's move that over spar extreme let's move that into our home directory okay so let's get back to here so starting down right here let's we create a spark context use that spark context creative streaming context this number is like the interval between the times when it runs this batch that's not exactly real-time for example if we set this to 60 every minute it would gather all the new data that came in the stream and then run the process on it we're gonna do it five which is pretty low but it allows us to see it just a little more interesting for us we creates a spark session which we've zis config to say this should look familiar where we want anywhere that databases that's a meta story URI or meta star server that's running right here's where we connect to Kafka sorry this good thing we're looking through this this is our topic we want to be looking at tweets not messages but that broker list is our Kafka broker at 9:09 - here's where we transform our line so for each tweet we grab just the text itself we get a count of the words and then account of the total length and then right here is where the real processing is happening so this transform variable is a direct stream object and it's just built of a bunch of rdd's which is another object that's similar to a data frame and we just say for each of those go handle it which is this method up here so it passes in the RDD and if it's not empty we connect to that spark context or the spark session and we create a data frame from it so we take that RDD created data frame give it the schema text count of words length we're going to show to the console so we can monitor it a little and then we'll say data frame right save this as a table to default tweets which is what we have in hive formatted as high and we're going to append to the table we don't want to erase it each time and then at the bottom we use our sec which is our streaming context and we start and await termination just says keep going till we tell to stop so I know we've kind of went through that quickly but let's see it in action so double check the zookeepers running our brokers running our meta stores running in this shell over here I'm going again to hive rename this hive and this one will be our producer and this one will be [Music] transformer use default show tables let's just get our command ready so we can use it again from tweets okay so nothing there yeah so let's start producing by running tweet stream so it's sending tweets to the Kafka queue let's start our transformer we're going to use spark submit we need to add in that extra jar file jars and our actual file okay one sec let's stop that looks like unexpected character maybe I had an issue pacing everything you know I don't die okay yeah I don't think it's good to pace things with them let me just clean this out real quick I just let it overflow sorry I'm not an expert at them so this is a little slow okay that's not as clean looking but it should fix it all right so we got that running it's gonna be looking for stuff that's not there cuz we paused our producer but I guess we should have check that one first all right let's start this up again our tweet stream so that's hang tweets over and it's going to wait at five-second interval before it gathers them all all right so this is someone talking about pizza probably we got the text dude a number of words and the total length and you can kind of ignore these errors all right so here's some other ones we can't really read them very well because it tries to form of nicely for us but they're coming through it's processing them doing a spark transformation on it but now let's check if we are getting them on high if it looks like we are obviously very bad format but what can you expect let's just do a count on it okay that's gonna take it's gonna do a little MapReduce job so it might take a bit but looks like we already have 144 tweets in and they keep on coming and this is just keep on processing looks like this person talking about pizza five words of 27 characters in length just run that again Wow now we have between 216 so I'm gonna stop this guy I don't like even gets a little emojis - that's fun so I'm gonna quit this guy to see how much we got total so 244 so hopefully that gives you a good example of I know obviously this is pretty simple we were just counting the characters and the words for a tweet which isn't very interesting but it does show you that you can have a Kafka cluster make a producer that creates a stream and as that name is continually producing you can have some app that deals with it and the stream processor can do anything it's really ours is pretty simple we were just counting and saving but you could do a big transformation on it and save it to your hive table and then that hive table can be used for many other purposes so hopefully this does give you a better working example with this and I'm sure I miss something or explain something wrong so please forgive me for that but thanks for watching
Info
Channel: Davis Busteed
Views: 10,589
Rating: undefined out of 5
Keywords: Kafka, Spark, Spark Streaming, Hive, Hadoop, Apache
Id: 9D7-BZnPiTY
Channel Id: undefined
Length: 57min 33sec (3453 seconds)
Published: Mon Jul 15 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.