Three tools for tackling Big Data are Apache Spark, an extremely fast general purpose cluster computing engine. Scala, a general-purpose programming language widely used in programming Spark. Apache Zeppelin, Web-based notebook that enables data-driven,
interactive data analytics and collaborative documents. If you know Jupyter, you will feel right at home with Zeppelin. In this article, you will learn how to setup your Zeppelin notebook environment to allow querying Oracle databases with Spark Scala.
This article is written using the Hortonworks Data Platform (HDP 2.6), one of the top three Hadoop distributions currently available. The other two are Cloudera and MapR.
Install Oracle JDBC Driver
In order to query Oracle from Spark, we need to first install an Oracle Database JDBC Driver. Go ahead and download ojdbc6.jar from the following link:
Next, we will import this jar into our hadoop cluster. Open up your browser and go to the Files View of Ambari. (You’ll notice my user is raj_ops, a default admin user of the HDP sandbox)
Upload the oracle jar to a HDFS folder of your choice. In this example, we are loading the jar to the /tmp HDFS location.
Zeppelin Shell Commands
Once the file is loaded into HDFS, we will do the following steps within Zeppelin. Aside from the following, you could also use SSH to copy files into your hadoop environment.
Open up a new Zeppelin notebook. Run the below shell command to transfer the file from HDFS to the local drive. We will be copying the ojdbc.jar file to the /user/lib folder.
%sh sudo hadoop fs -copyToLocal /tmp/ojdbc6.jar /usr/lib
To confirm that the file was copied, you can run below shell command in Zeppelin to list all files in the folder.
%sh ls /usr/lib
After running the above code, make sure you see the oracle jdbc file successfully copied to the /usr/lib folder. We are almost done.
Next, we will add the oracle jdbc file as a dependency of our Spark interpreter. In Zeppelin, an interpreter is a plugin that enables you to use a specific language/data-processing-backend. In our case, we will be using the %spark2 interpreter which allows us to program Spark using Scala.
Open the Interpreter configuration page and search for the spark2 intepreter. You will be seeing something like the below image.
Now, click on Edit and add the following artifact to the dependencies: /usr/lib/ojdbc6.jar
You should have something like the below image.
Finally, save the setting which will restart your interpreter and you will be now able to query oracle within Zeppelin.
Spark Scala Query Oracle within Zeppelin
Now that the Oracle JDBC is available and recognized by our Spark Scala interpreter, we can now begin to query oracle within Zeppelin.
First step is to to obtain sqlContext. This class is the entry point into the Spark SQL functionality. To initialize it, it needs SparkContext (our sc variable in the below code snippet).
%spark2 val sqlContext=new org.apache.spark.sql.SQLContext(sc)
Now, you can run a query to Oracle by running a code in the below format.
val query = "(select * from TableName) emp" val df = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:user/pwd@//host/sid" ,"dbtable" -> query) )
Notice, your query must be inside parenthesis and emp next to it. As far as the Oracle connection goes, replace usr, pwd, host and sid of the connection string with your own values.
Congratulations! You are now able to query oracle from Spark Scala within Zeppelin. You learned to transfer files from HDFS to local drive through Zeppelin shell commands (sh interpreter) and were able to setup Zeppelin in order to query Oracle from Spark Scala. If you are using python instead of scala, once performing the jar setup, it's just a matter of changing the code and interpreter to pyspark.