Setting up pyspark 2.3.1 on your ubuntu 18.04

To set up Pyspark 2.3.1 on ubuntu , you need to have java, 1.8+,scala,python 3.5.x and py4j package need to be installed.And as an IDE , we will be using jupyter-notebook here.

Setting up pyspark in ubuntu is very simple if you follow the below steps in order

Open your Terminal on ubuntu OS
ubuntu 18.04 will have python 3.5.x by default or else you need install it as per the instructions in the below link(its a very good discussion forum for installation):

https://askubuntu.com/questions/865554/how-do-i-install-python-3-6-using-apt-get
To keep the latest updates in sync with your os, run the below command:
sudo apt-get update
once update is done, install java 1.8 using below command:
sudo apt install openjdk-8-jre-headless
Once java is installed validate it by using java --version:
it should display jdk version as 1.8.x
once validated, install scala as below:
sudo apt-get install scala
once installation is done,validate it using scala --version:
it should display the latest scala version
install the pip3 using below command:
pip3 install py4j
Once it is done,validate it just by running pip3:
it should display all available help options for pip3
Now install the IDE,jupyter, by entering :
sudo apt install jupyter
once installation completes, validate it by entering jupyter-notebook:
it should open a firefox terminal for jupyter notebook with a running session on terminal
Now quit the terminal using ctrl+c and close the firefox window also
Now download the tar file for spark 2.3.1 (stable version of spark as of this date) from below link:
https://www.apache.org/dyn/closer.lua/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
Copy the downloaded tar file to your home directory
extract it using below command:
sudo tar -zxvf spark-2.3.1-bin-hadoop2.7.tgz
Now next step is to define and export the environmental variables permanently for the user:
For defining and exporting env variables permanently in ubuntu please add the below lines to your ~/.bashrc file and save it :
First replace the 'user_name' with yours user and copy the entire lines from below:

export SPARK_HOME='/home/user_name/spark-2.3.1-bin-hadoop2.7'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHONPATH=python3
export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-amd64"
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
Now open .bashrc using sudo nano ~/.bashrc paste the copied content to the bottom most line of the file and exit it by ctrl+x (save and exit).
Now restart your terminal
Enter python3 to open the python3 terminal
Now type below command and press enter:
import pyspark

cursor shoould move to next command line without any error.This means you have successfully completed pyspark 2.3.1 set up on ubuntu. If it throws some error, it means , you have messed up some where in between.

Big Data Analytics

Search This Blog

Setting up pyspark 2.3.1 on your ubuntu 18.04

Comments

Post a Comment

Popular posts from this blog

Understanding spark architecture in Deep with YARN

Different ways of Transposing a Dataframe in Pyspark

Spark (With Python) : map() vs mapPartitions()