Install Spark on Windows Laptop for Development

Apache Spark is an open source general purpose cluster computing engine designed to be lightning fast. It is extremely fast data processing engine which also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. In this article you will learn how to setup a pyspark development environment on Windows. Read along to learn how to install Spark on your windows laptop or desktop.

We will be installing Spark 2.3.2. This is the latest version (as of this article) released in September 2018. 

Java Prerequisite

Make sure you have Java 8 installed on your pc prior to proceeding. If you aren’t sure, open up the command terminal and run the following command.

java -version 

After running the above, you should see something like below. If not, install java first and set the appropriate environment variables.

 PySpark Java Prerequisite

To install Java and setup the environment variables do the following.

  • Download and Install Java 8 from this link.
  • Set environment variables
    • Variable: JAVA_HOME  
    • Value: C:\Program Files\Java\jdk1.8.0_91 
    • Variable: PATH
    • Add Value: C:\Program Files\Java\jdk1.8.0_91\bin 

Download Spark Binaries

With the Java prerequisite already installed proceed with the following. Head on to the spark downloads page here and download the following file: spark-2.3.2-bin-hadoop2.7.tgz

Spark Download

Once you downloaded the file, extract it to a location of choice. We will be then pointing environment variables to that location. 

I chose this location on my pc: C:\Spark\spark-2.3.2-bin-hadoop2.7 and will be using this in the configurations explained later in the article. 

Download Winutils

Spark on windows needs Winutils.exe to work properly. This is the next step, follow this link and download the winutils.exe.

Save the file under the /bin folder of spark. In my case, the full path of where I save the file is: C:\Spark\spark-2.3.2-bin-hadoop2.7\bin

Winutils Save Path

 

Environment Variables for Spark

With all the spark files and prerequisites in place, it's now time to set some important environment variables for Spark.

  • Variable Name: SPARK_HOME
  • Value: C:\Spark\spark-2.3.2-bin-hadoop2.7

 

  • Variable Name: PATH
  • Add Value: %SPARK_HOME%\bin

 

  • Variable: HADOOP_HOME
  • Value: %SPARK_HOME%

 

To test that spark is setup correctly, open the command prompt and cd into the spark folder: C:\Spark\spark-2.3.2-bin-hadoop2.7\bin

Next, run the following command:

spark-shell

You will be seeing spark shell open up with an available spark context and session. You have now setup spark!

Spark Shell

Install PySpark

With Spark already installed, we will now create an environment for running and developing pyspark applications on your windows laptop. On my PC, I am using the anaconda python distribution.

First step, we will create a new virtual environment for spark. The environment will have python 3.6 and will install pyspark 2.3.2. The latter matches the version of spark we just installed.

Run the command:

conda create -n spark python=3.6

Next, activate the environment using: 

activate spark

Lastly, install pyspark 2.3.2 using pip by running the command:

pip install pyspark==2.3.2

Conclusion

With the above steps completed, you have successfully setup a spark environment on windows for development purposes. Next steps will be to start coding! In a later article, I’ll show you how to develop a simple application using pyspark and the environment we just setup.

 

MJ

Advanced analytics professional currently practicing in the healthcare sector. Passionate about Machine Learning, Operations Research and Programming. Enjoys the outdoors and extreme sports.

Related Articles

>