Apache Spark Installation Guide on Ubuntu 16.04 LTS

Apache Spark is an open source general purpose cluster computing engine designed to be lightning fast. Originally developed at the University of California at Berkley, it’s codebase was  donated to the Apache Software Foundation and as of today it is the largest open source project in data processing. Did I mention it is extremely fast? Yes it is, and there is much hype around it. On the other hand, starting to use it is not too straight forward which is the reason for this article. I will show you how to install Spark in standalone mode on Ubuntu 16.04 LTS to prepare your Spark development environment so that you can begin playing with it.

 

Spark Cluster Manager Types

Spark can be deployed (as of this writing) in four different ways. The first (which we will use) is in Standalone mode. This is the easiest way to get started with Spark utilizing the included cluster manager that comes prepackaged with Spark.

Other methods to deploy a Spark Cluster include:

  • Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.
  • Hadoop YARN – the resource manager in Hadoop 2.
  • Kubernetes – an open-source system for automating deployment, scaling, and management of containerized applications.

 

Pre-Requisites

To get started you need a clean install of Ubuntu 16.04 LTM and download Spark. We will be using the latest version 2.3 for Hadoop 2.7 and later which you can do here: https://spark.apache.org/downloads.html

Go ahead and download the spark-2.3.0-bin-hadoop2.7.tgz file.

Apache Spark 2.3

Next, download Scala over at https://www.scala-lang.org/download/

You want to download the Scala Binaries for unix I am using Scala 2.16 which would download the file: scala-2.12.6.tgz

Spark Installation Steps

Open up your Ubuntu Terminal and follow the following steps. Spark 2.3 requires Java 8, that is where we will begin. We will also install SSH server to allow us to SSH into our Ubuntu.

Step 1: Install Java 8

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install openjdk-8-jdk

Step 2: Install SSH Server

sudo apt-get install openssh-server

If you are virtualizing Ubuntu with Virtual-Box, use the following settings to be able to reach your server.

 VirtualBox Apache Spark

Next up we create a group and a user under which spark will run.

Step 3: Create User to Run Spark and Set Permissions

sudo addgroup sparkgroup
sudo adduser --ingroup sparkgroup sparkuser

Configure Permissions for User

sudo gedit /etc/sudoers

Add the below line to the file:

sparkuser ALL=(ALL) ALL

Step 4: Create Spark Directory and Set Permissions

sudo mkdir /usr/local/spark

Change the folder Ownership to sparkuser

sudo chown -R sparkuser /usr/local/spark
sudo chmod -R 755 /usr/local/spark

Step 5: Create Scala Directory and Set Permissions

sudo mkdir /usr/local/scala

Change Ownership to sparkuser

sudo chown -R sparkuser /usr/local/scala
sudo chmod -R 755 /usr/local/scala

Step 6: Create Spark Temporary Directory

sudo mkdir /app
sudo mkdir /app/spark
sudo mkdir /app/spark/tmp

Change Ownership to sparkuser

sudo chown -R sparkuser /app/spark/tmp
sudo chmod -R 755 /app/spark/tmp

Step 7: Switch to sparkuser

su sparkuser

Step 8: Extract and Move Files

Switch to the directory where you saved the files downloaded in the pre-requisites section. In my case they are in the downloads folder.

cd Downloads

Extract

sudo tar xzf spark-2.3.0-bin-hadoop2.7.tgz
sudo tar xzf scala-2.12.6.tgz

Move

sudo mv spark-2.3.0-bin-hadoop2.7/* /usr/local/spark
sudo mv scala-2.12.6/* /usr/local/scala

Step 9: Set Environment Properties

sudo gedit $HOME/.bashrc

Add Below Properties

export SCALA_HOME=/usr/local/scala
export SPARK_HOME=/usr/local/spark
export PATH=$SPARK_HOME/bin:$JAVA_HOME/bin:$SCALA_HOME/bin:$PATH

After Saving the File, reload environment

source $HOME/.bashrc

Step 10: Set Spark Environment Properties

cd /usr/local/spark/conf

Copy template

sudo cp spark-env.sh.template spark-env.sh

Open File

sudo gedit spark-env.sh

Add Properties to File

export SCALA_HOME=/usr/local/scala
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=2
export SPARK_MASTER_IP=127.0.0.1
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_DIR=/app/spark/tmp

Step 11: Set Spark Defaults

Copy Template

sudo cp spark-defaults.conf.template spark-defaults.conf

Edit File

sudo gedit spark-defaults.conf

Add Properties To File. Replace IP_ADDRESS with your own. You can also use 172.0.0.1 (localhost).

spark.master                     spark://IP_ADDRESS:7077

Step 12: Set Slaves (localhost in our case)

Copy Template

sudo cp slaves.template slaves

Edit File

sudo gedit slaves

Add Properties to File

localhost

Step 13: Run the Spark Start Script

Change directory

cd /usr/local/spark/sbin

Run Start Script

./start-all.sh

 Apache Spark Started

Step 13: Browse the Spark Web-UI

You will now be able to open up your browser and view the web-ui on http://YOUR_SPARK_IP:8080/ which will show 2 workers.

Apache Spark Web UI

Notice, you also have the address for Spark Master with port 7077. At this point we have a stand alone single node cluster of Apache Spark.

Step 14: To Stop Spark

./stop-all.sh

Conclusion

Congratulations! You know have a stand alone single node Spark cluster that you can use to test and learn to write Spark programs. You can connect to this cluster to submit applications or even debug them from your favorite IDE. 

 

MJ

Advanced analytics professional currently practicing in the healthcare sector. Passionate about Machine Learning, Operations Research and Programming. Enjoys the outdoors and extreme sports.

Related Articles

>