Apache Spark is an open source general purpose cluster computing engine designed to be lightning fast. It is extremely fast data processing engine which also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. In this article you will learn how to setup a pyspark development environment on Windows. Read along to learn how to install Spark on your windows laptop or desktop.
We will be installing Spark 2.3.2. This is the latest version (as of this article) released in September 2018.
Make sure you have Java 8 installed on your pc prior to proceeding. If you aren’t sure, open up the command terminal and run the following command.
After running the above, you should see something like below. If not, install java first and set the appropriate environment variables.
To install Java and setup the environment variables do the following.
- Download and Install Java 8 from this link.
- Set environment variables
- Variable: JAVA_HOME
- Value: C:\Program Files\Java\jdk1.8.0_91
- Variable: PATH
- Add Value: C:\Program Files\Java\jdk1.8.0_91\bin
Download Spark Binaries
With the Java prerequisite already installed proceed with the following. Head on to the spark downloads page here and download the following file: spark-2.3.2-bin-hadoop2.7.tgz
Once you downloaded the file, extract it to a location of choice. We will be then pointing environment variables to that location.
I chose this location on my pc: C:\Spark\spark-2.3.2-bin-hadoop2.7 and will be using this in the configurations explained later in the article.
Spark on windows needs Winutils.exe to work properly. This is the next step, follow this link and download the winutils.exe.
Save the file under the /bin folder of spark. In my case, the full path of where I save the file is: C:\Spark\spark-2.3.2-bin-hadoop2.7\bin
Environment Variables for Spark
With all the spark files and prerequisites in place, it's now time to set some important environment variables for Spark.
- Variable Name: SPARK_HOME
- Value: C:\Spark\spark-2.3.2-bin-hadoop2.7
- Variable Name: PATH
- Add Value: %SPARK_HOME%\bin
- Variable: HADOOP_HOME
- Value: %SPARK_HOME%
To test that spark is setup correctly, open the command prompt and cd into the spark folder: C:\Spark\spark-2.3.2-bin-hadoop2.7\bin
Next, run the following command:
You will be seeing spark shell open up with an available spark context and session. You have now setup spark!
With Spark already installed, we will now create an environment for running and developing pyspark applications on your windows laptop. On my PC, I am using the anaconda python distribution.
First step, we will create a new virtual environment for spark. The environment will have python 3.6 and will install pyspark 2.3.2. The latter matches the version of spark we just installed.
Run the command:
conda create -n spark python=3.6
Next, activate the environment using:
Lastly, install pyspark 2.3.2 using pip by running the command:
pip install pyspark==2.3.2
With the above steps completed, you have successfully setup a spark environment on windows for development purposes. Next steps will be to start coding! In a later article, I’ll show you how to develop a simple application using pyspark and the environment we just setup.