Install Apache PySpark on Windows PC | Apache Spark Installation Guide

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign [Music] spark on a Windows PC and for Mac users I am really sorry I don't have Mac PC but I'll be giving you the link to the step-by-step procedure of how to install Apache Spa in Mac as well so without further Ado let's get into it okay so let's get started with our spark installation so there are some prerequisites that you need to cover first to install spark in your Windows PC the first one is you should be having the Java development kit which is also known as jdk in your system and it should be the version of 8 and above and also you should have the latest version of python so these prerequisites we need to cover and all these links to download your tools I'll be giving in the description below so that it will be easier for you so without further Ado let's get started okay so just quickly go through your favorite browser and search for download jdk okay so once you're there go to downloads and you can download any version so I'll be choosing the latest version which is Java 19. at the time of shooting this video so go to Windows and you have to download the installer file which is dot EXE so just download okay so once it is download just open that file so just click on next and here I'll highly recommend you to give to extract your Java in a specific directory so here we will be giving the change and go to the C drive and here we will create a new folder so in C just create a folder as Java then go to the Java and here you again have to create a new folder as jdk which is Java development kit okay so just select this and this will be a directory for Java so click on next and that's it your Java installation is successful so you can just close it and then we can jump and install python okay so for installing python just open a new tab and search for python download so just search it and go to python.org website and you can just select the newest version of python because it will not affect your spark installation so just download python 3.11.1 and that's it it's downloaded okay so once you downloaded it just open it and you have to select the add python.exe to the path because it will make so much easier and you don't have to provide your environment variable so just click on install now and there you go your setup was successful so you have covered two prerequisites to install your spark installation so now we can go ahead and install Spark so just open the new tab and go to download Spark and here you can see the spark.apache.org website so just go to that and here the versions are very important because you have to select a specific Apache Hadoop version which will be compatible with your spark and Scala so here we have to select the Hadoop 2.7 version because you have to also install when dependency which is Win utils.exe file which is specifically designed for a specific Hadoop version so according to that you have to select your Hadoop version here and you can select the spark release as 3.3.1 so at the time of this video this was 3.3.1 but definitely if you are installing it during your installation that might be different but just make sure that you're choosing 2.7 okay so once it is done you can choose on download spark which will provide you the tar file so just click on it and just click on this website okay so it is getting downloaded it's not that use file just 261 MBS so just wait for it to complete okay so once the download is complete you have to navigate to your downloads folder so just go to the downloads folder and just select the file and you can go to the C drive and create a new folder here so I'll just create a new folder named as spark or in smaller case and here you can just paste the file and now you have to extract this file because since it's at our file you need to extract it to get the access of spark extract here and that's it you have extracted your tar file now so here you can see the bin directory where all the pi spark and Spark shell command scripts are present and from here you can directly kick off the spark shell But first you have to download the win utils first and it should be of the Hadoop version which you have downloaded which is 2.7 so to do that I'll give you a link in the description below so you have to go to one git directory where all these files are present for the specific versions okay so this is the GitHub repository and we can just download our wind utils.exe file from here so I'll be giving you this link so you'll not face any issues so just go to the Hadoop 2.7.1 go to the main directory and at the last you will find the win utils.exe so just go to that and just download okay so download is completed so again just go to the file location get the win utils file and then come to C drive and you have to create the new folder named as Hadoop because that is a dependency to kick off spark on your PC so just give Hadoop and in Hadoop the bin directory and inside bin directory you can paste that file so I think the name is just messed up delete that one and okay so we have Hadoop setup and also Apache spark in the spark folder so you should be having the Java folder and inside Java there is jdk and here inside you have the Java installed then you should be having the Hado folder inside Hadoop there is bin and win utils and the last but not least you should be having the spark folder so here inside spark we have the tar file and here in bin directory we have all the commands so to verify it quickly just go to the command prompt so first one would be check the Java version so give Java version okay and you can see we have the latest version then you can give like python version and there you go you have 3.10 version but now you have to set the paths for your spark and Hadoop and that you are going to do in the edit environment variable setting so just go to Windows and search for environment variables so this is what it looks like edit the system environment variables click here and go to the environment variables option and that's it so here you have to create your environment variables now so just follow this very carefully because any small mistake will give you an error and that is a very headache because these small mistakes can waste so much time and it did for me too and to simplify your life I'll be giving all this part directly to you in the description below so you can just copy paste it and avoid any confusion Okay so to add the new variable click on new the first one you will create is Java home so just give like Java underscore home and that lies in the C directory So to avoid the mistakes you can just go to the browse directory so just go to browse directory then just find where you have installed the Java so just we have installed it in the C drive so just navigate to the C and here we'll find the Java folder and inside Java we have the jdk folder and then click OK there you go we have the C Java and jdk just click ok that will the first step then you have to add the Hadoop home so as you have downloaded that win utils file just you have to create a new one give like Hadoop underscore home and again browse directory navigate to the C then come to Hadoop and you have to give the Hadoop here click OK and click on OK so you have set up the Java home and the Hadoop home till now now you have to set up the spark home so again you have to go to the new one and give like spark and just go home and give like browse directory then again navigate to the C where if you have extracted tar file go to the spark inside spark we have our file and click ok that's it give okay and you're done you have set up spark home as well and for safer side you also have to do the pi spark home and you have to give it to the python.exe file so again just click on new one give like Pi spark and just go home browse the directory and here just find the python so as we haven't chose the specific directory for it you have to go to your C drive then go to the user and then user find your username here just go to the app data then go to local then you have to find the programs file so you go to programs and here you have the python so just go to python 311 and you have to select the python 3.11 because that is the version we have just installed so just click on ok but now again after the python311 you have to give slash and give like python.exe which is the executable file for python so that is a mandatory step so just click on OK and that's it you have the Hadoop home Java home Pi spark home and the spark home setup but you have to also provide the path for it so just go to path in the system variables and click on edit so here are the different paths so here you have to provide the Java spark Hadoop all the paths and their bin directories so as you can see we have the Java home setup to the bin directory so now the same process we have to follow for spark and Hadoop so again you have to go to new give like Hadoop underscore home percent sign slash bin because the win utils files is present there again you have to give new the spark home the sign slash bin that's it just click on OK and you're good to go so just click on OK and OK and that's it that is you have to follow to install a purchase bar so to verify your installation just open the command prompt and as we have already checked the Java python we just have to kick off our spark shell so to submit your command give like Spark Dash shell so this will enable the spark shell hit enter it will take some time because it's a first boot and there you go you have the spark running on your Windows PC and you have opened the Scala shell so here you can see the version which we have downloaded 3.3.1 and this color version is 2.12 so that's it that is all you have to do to install spark in your PC but that is not enough we are not going to use spark shell to write our application we we should be having some ID installed in our system so that we are going to see in a next video but as you can see this is a Scala shell but we are going to use Python for our full course so what we'll do is we will again verify if the pi spark is running fine because python we already have installed so Pi spark should also run fine so just give like quit Okay and now just type like Pi Spark and hit enter and there you go the pi spark is also running fine okay so to quickly verify it we can just submit the print command and in this case just print the spark version so spark dot version and there you go you got the 3.3.1 which means that Pi spark shell is also running fine and to visualize your spark application you can just see the web UI so just click on that and go to your favorite browser any new window just open that and as you can see here you can monitor your Apache spark application so as you can see this is the home this will be like the event timeline so once you submit your job it will show the status as succeeded which are fail or or any running job on your cluster and there are different sorts of stuff you can monitor in this so you can go to environment and see what are the spark properties which we have already set up so we haven't messed around it and here you can also see the Java home which is again in the in our C drive and also you can find the stages for all the jobs so since we don't have the jobs we don't have any data here you can go to storage again nothing is here and you can go to executors so as you can see you got a very holistic view of your applications and you can monitor your jobs and see how it is running under the hood and processing your data in a parallel way so this was all about setting up a purchase Park in a system but in the next lecture let's set up an IDE so we will be installing the Anaconda distribution which already comes with all the stuff you need so you can either use the jupyter notebooks as well as the spider ID so we are going to explore both of these options so don't worry so whatever works for you that is fine if you already have pycharm running then you can set up your spark application in the pycharm as well so I'll be giving you all these links in the description below as well as the environment variables which we need to pass to able to kick off your spark so if you face any difficulties let me know in the comments and I'll get back to you as soon as possible I hope you like this lecture so please subscribe to our Channel and also ring the notification Bell to get the latest updates and don't forget to follow us on our social media which I have Linked In the description below thanks for watching
Info
Channel: AmpCode
Views: 99,314
Rating: undefined out of 5
Keywords: apache spark, big data, pyspark installation on windows pc, pyspark installation on windows, pyspark, pyspark installation, apache spark installation, apache spark installation windows, how to install apache spark on windows 10, spark installation windows, data engineering, data engineering skills, pyspark tutorial, pyspark tutorial for beginners, pyspark tutorial ampcode, data science, data engineer, how to become a data engineer, apache spark tutorial, what is apache spark
Id: OmcSTQVkrvo
Channel Id: undefined
Length: 14min 41sec (881 seconds)
Published: Tue Feb 07 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.