Google Cloud Tutorial - Hadoop | Spark Multinode Cluster | DataProc

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hello and welcome to Google Cloud tutorial at Learning Journal. If you are learning Hadoop and Apache Spark, you will need some infrastructure.  You can do a lot on your local machines or try things in a VM. But sooner or later, you will realize following problems. You have limited resources and all you can do is to create a single node or a standalone environment. But you want to learn more on a real multi-node cluster. Installation and setup of Hadoop and Spark is a tedious and time-consuming process. As an alternative, you can download a ready to use VM image, but they don't offer you a multi-node cluster. In this video, I will set up a six-node Hadoop and Spark cluster. I will do it in less than two minutes. And most important thing. You don't have to buy expensive machines, download large files, or pay anything to get this cluster. So, let’s get started. In the very first video of this tutorial, we created a free account in Google Cloud and also created a VM in GCP. You can use your Free GCP account to set up a six-node Hadoop cluster. Let's do that. Google offers a managed Spark and Hadoop service. They call it Google Cloud Data Proc. It is almost same as Amazon's EMR. You can use Data Proc service to create a Hadoop and Spark cluster in less than two minutes. Let me show you step by step process. I am assuming that you have already setup your Google cloud account and a default project as explained in the first video. Now, start your Google Cloud Console. Go to products and services menu. Scroll down to Data Proc and select clusters. Hit the create cluster button. Give a name to your Cluster. Choose your region. The default selection is global. We don't want our cluster to spread across the globe. I want to limit all my nodes in a single region and a single zone. That keeps the cost low. Choose a machine type for your master node. With your free GCP account, you cannot have more than eight concurrent CPU cores. The paid account doesn't have such limits. But eight CPU cores are good enough for us. I will give two CPU cores to the master node. The next option is to select a cluster type. We have three choices. A single node cluster. A standard multi-node cluster. This guy comes with no High Availability. A High Availability multi-node cluster. This option will create three master nodes and enable High Availability. We will go with a standard multi-node cluster. Let's reduce the disk size to 32 GB. That's enough for me. Now, let's choose the data node configuration. I will pick up a single CPU for each worker. Let's take five workers. Reduce the disk size to 32 GB for each worker. Great. This configuration gives me a six-node Hadoop cluster. One master and five workers. You can hit the create button, and the Data Proc will launch a six-node cluster in less than two minutes. The new Cluster comes with Hadoop and Spark. But I want to configure a Jupyter notebook as well.  So let me expand this part. We keep all the defaults. Do you see this option? The image version. The latest image version as on date is 1.2. What does that mean? The Data Proc 1.2 comes with Spark 2.2 and Hadoop 2.8. That's the most recent version of Spark and Hadoop. The Google Data Proc team is quick enough to make the latest versions available for your exploration. If you need an older version of Spark, you can choose an older Data Proc version. Ok, the initialization actions. The initialization action allows us to install additional components on this cluster. You need to supply a script for that. You can create your own script, but the good thing is that the Data Proc team has already prepared many of them. Here is the list. You can get Apache Drill, Flink, Jupyter, Kafka, presto, Zeppelin and much more. We will use jupyter.sh. Good. Let's hit the create button.  Wait for a minute or two, and the Data Proc API will provision your cluster. Amazing. Isn’t it. No downloads, No installation, nothing. Your cluster is ready in just one click. We used Cloud Console UI to create this cluster,  but you can use Cloud SDK to create a script and launch it by just executing a script file from your local machine. You can also achieve the same results using Rest based APIs. The SDK and the APIs are the primary tools for automating things in Google Cloud. I will cover them some other day. Now the next part of it. How to access and use this cluster. I would want to do at least two things. SSH to one of the nodes in this cluster and access Web-based UIs. For example Resource manager UI and my Jupyter notebooks. The first part is simple. Click on the cluster, and you should be able to see the list of VMs. This one is the hostname of my master, and these are the worker nodes. You can ssh to the master node. In fact, if you check your dashboard, you will see your VMs. You can ssh to any of these VMs. You may not want to do that, but the GCP doesn't stop you from accessing your VMs. Let's ssh to the master node. These VMs are Debian so your CentOS or Redhat specific commands may not work.  But that shouldn't be a problem. Right. You can access HDFS. You can also start Spark shell. You can submit a spark job too. Check out my Spark Tutorials for more on Spark. I will be using this cluster in many of my Spark tutorial videos. Now the next part. How to access Web Based UIs. All such UIs are available on different ports.  For example, RM is available on 8088 and Jupyter is available on 8123. I have an option to create a firewall rule and open these ports. But I don't want to do that. Because these services are not secure and I don't want to open several ports for attackers. Right. There is another easy alternative. SSH tunnel. It's a two-step process. We can create an SSH tunnel to the master node. And then configure your browser to use SOCKS proxy. The proxy will route the data from your browser through the SSH tunnel. Let's do it. The first thing is to install Google Cloud SDK. The link to download the SDK is available in the documentation. Start the installer and follow the on-screen instructions. The Installer automatically starts a terminal window and runs an init command. It will ask you for your GCP account credentials and the default project name. Once you have the SDK, you should be able to use gcloud and gsutil command line tools. Ok, we wanted to create an SSH tunnel. Start a terminal and use this command. The command gcloud compute ssh will open a tunnel from port 10000 on your local machine to the GCP zone us-east1-c and the GCP node spark-6-m.  You can change the zone name and the master node name based on your cluster setup. The -D flag is to allow dynamic port forwarding and -N to instruct gcloud to not to open a remote shell. Great. The tunnel is open. Minimize these windows. The next step is to start a new browser session that uses the SOCKS proxy through this tunnel.  Start a new terminal. Start your browser using this command. I am starting chrome.exe with this URL. This URL is for my YARN Resource Manager. Next one is the proxy server. It should use the socks5 protocol on my local machine's port 10000. That's the port where we started the SSH tunnel. The next flag is to avoid any DNS resolves by chrome. Finally, the last option is a non-existent directory path. This option allows chrome to start a brand new session. Great. You can access the resource manager. It shows that one application is already running on this cluster. You can access your jupyter notebook by changing the port number to 8123.  You can create a new Notebook. It doesn't offer Scala Notebooks because we haven't configured Apache toree or some other equivalent kernel. But I think that's fine. You can start PySpark at least. You have the cluster. You can access it over the web and SSH. Execute your Jobs, play with it and later go back to your Dataproc clusters list and delete it. We don't have an option to keep it there in shutdown state. Creating and removing a cluster is as simple as few clicks. You can create a new one every day, use it and then throw it away. You might be wondering about your data. Man, If I delete a cluster, what happens to my data. Well, If you stored it in HDFS, you lose it. Keeping data outside the HDFS allows you to create a cluster as and when you need it. and delete it when you are done with the compute workload. That's a fundamental architectural difference in elastic clusters and persistent clusters. Even if you are using Amazon's EMR, you don't keep data in HDFS. You would be keeping it in S3 or somewhere else. In case of Google Data Proc, you have many options. If you have a data file, you will keep it in GCS Bucket. The Google Cloud Storage Bucket. Let's leave GCS Bucket for another video. However, I will be using GCS Buckets as well as HDFS in my Spark Tutorials. So tune in to my Spark tutorials. Great. We learned a lot in this video. Now you have a multi-node cluster at your disposal. Thank you very much for watching Learning Journal. Keep Learning and Keep Growing.
Info
Channel: Learning Journal
Views: 74,877
Rating: 4.9229989 out of 5
Keywords: Google Cloud Tutorial, Hadoop Multinode Cluster, Dataproc, Spark Cluster
Id: 6DD-vBdJJxk
Channel Id: undefined
Length: 13min 4sec (784 seconds)
Published: Wed Nov 08 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.