To set up Pyspark 2.3.1 on ubuntu , you need to have java, 1.8+,scala,python 3.5.x and py4j package need to be installed.And as an IDE , we will be using jupyter-notebook here.
Setting up pyspark in ubuntu is very simple if you follow the below steps in order
Setting up pyspark in ubuntu is very simple if you follow the below steps in order
- Open your Terminal on ubuntu OS
- ubuntu 18.04 will have python 3.5.x by default or else you need install it as per the instructions in the below link(its a very good discussion forum for installation):
https://askubuntu.com/questions/865554/how-do-i-install-python-3-6-using-apt-get - To keep the latest updates in sync with your os, run the below command:
sudo apt-get update - once update is done, install java 1.8 using below command:
sudo apt install openjdk-8-jre-headless - Once java is installed validate it by using java --version:
it should display jdk version as 1.8.x - once validated, install scala as below:
sudo apt-get install scala - once installation is done,validate it using scala --version:
it should display the latest scala version - install the pip3 using below command:
pip3 install py4j - Once it is done,validate it just by running pip3:
it should display all available help options for pip3 - Now install the IDE,jupyter, by entering :
sudo apt install jupyter - once installation completes, validate it by entering jupyter-notebook:
it should open a firefox terminal for jupyter notebook with a running session on terminal - Now quit the terminal using ctrl+c and close the firefox window also
- Now download the tar file for spark 2.3.1 (stable version of spark as of this date) from below link:
https://www.apache.org/dyn/closer.lua/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz - Copy the downloaded tar file to your home directory
- extract it using below command:
sudo tar -zxvf spark-2.3.1-bin-hadoop2.7.tgz - Now next step is to define and export the environmental variables permanently for the user:
For defining and exporting env variables permanently in ubuntu please add the below lines to your ~/.bashrc file and save it :
First replace the 'user_name' with yours user and copy the entire lines from below:
export SPARK_HOME='/home/user_name/spark-2.3.1-bin-hadoop2.7'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHONPATH=python3
export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-amd64"
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin - Now open .bashrc using sudo nano ~/.bashrc paste the copied content to the bottom most line of the file and exit it by ctrl+x (save and exit).
- Now restart your terminal
- Enter python3 to open the python3 terminal
- Now type below command and press enter:
import pyspark
cursor shoould move to next command line without any error.This means you have successfully completed pyspark 2.3.1 set up on ubuntu. If it throws some error, it means , you have messed up some where in between.
Comments
Post a Comment