Skip to main content

Setting up pyspark 2.3.1 on your ubuntu 18.04

To set up Pyspark 2.3.1 on ubuntu , you need to have java, 1.8+,scala,python 3.5.x and py4j package need to be installed.And as an IDE , we will be using jupyter-notebook here.

Setting up pyspark in ubuntu is very simple if you follow the below steps in order



  1. Open your Terminal on ubuntu OS
  2. ubuntu 18.04 will have python 3.5.x by default or else you need install it as per the instructions in the below link(its a very good discussion forum for installation):

    https://askubuntu.com/questions/865554/how-do-i-install-python-3-6-using-apt-get
  3. To keep the latest updates in sync with your os, run the below command:
    sudo apt-get update
  4. once update is done, install java 1.8 using below command:
    sudo apt install openjdk-8-jre-headless
  5. Once java is installed validate it by using java --version:
    it should display jdk version as 1.8.x
  6. once validated, install scala as below:
    sudo apt-get install scala
  7. once installation is done,validate it using scala --version:
    it should display the latest scala version
  8. install the pip3 using below command:
    pip3 install py4j
  9. Once it is done,validate it just by running pip3:
    it should display all available help options for pip3
  10. Now install the IDE,jupyter, by entering :
    sudo apt install jupyter
  11. once installation completes, validate it by entering jupyter-notebook:
    it should open a firefox terminal for jupyter notebook with a running session on terminal
  12. Now quit the terminal using ctrl+c and close the firefox window also
  13. Now download the tar file for spark 2.3.1 (stable version of spark as of this date) from below link:
    https://www.apache.org/dyn/closer.lua/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
  14. Copy the downloaded tar file to your home directory
  15. extract it using below command:
    sudo tar -zxvf spark-2.3.1-bin-hadoop2.7.tgz
  16. Now next step is to define and export the environmental variables permanently for the user:
    For defining and exporting env variables permanently in ubuntu please add the below lines to your ~/.bashrc file and save it :
    First replace the 'user_name' with yours  user and copy the entire lines from below:

    export SPARK_HOME='/home/user_name/spark-2.3.1-bin-hadoop2.7'
    export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
    export PYSPARK_DRIVER_PYTHON='jupyter'
    export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
    export PYSPARK_PYTHONPATH=python3
    export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-amd64"
    export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
  17. Now open .bashrc using sudo nano ~/.bashrc paste the copied content to the bottom most line of the file and exit it by ctrl+x (save and exit).
  18. Now restart your terminal
  19. Enter python3 to open the python3 terminal
  20. Now type below command and press enter:
    import pyspark

    cursor shoould move to next command line without any error.This means you have successfully completed pyspark 2.3.1 set up on ubuntu. If it throws some error, it means , you have messed up some where in between.

Comments

Popular posts from this blog

Understanding spark architecture in Deep with YARN

OVERVIEW Apache spark is a Distributed Computing framework. By distributed it doesn’t imply that it can run only on a cluster. Spark can be configured on our local system also. But Since spark works great in clusters and in real time , it is being implemented in multi node clusters like Hadoop, we will consider a Hadoop cluster for explaining spark here. We can Execute spark on a spark cluster in following ways Interactive clients(scala shell,pyspark etc): Usually used for exploration while coding like python shell Submit a job (using spark submit utility):Always used for submitting a production application Basic Architecture Spark follows a Master/Slave Architecture. That is For every submitted application, it creates a Master Process and multiple slave processes. Master is the Driver and Slaves are the executors. Say If from a client machine, we have submitted a spark job to a cluster. Spark will create a driver process and multiple executors. Similraly...

Different ways of Transposing a Dataframe in Pyspark

When I have started coding on transposing Dataframes, I found below different methods.  I am sharing all those info here. Creation of a test Input Dataframe to be Transposed ds = {'one':[0.3, 1.2, 1.3, 1.5, 1.4, 1.0],'two':[0.6, 1.2, 1.7, 1.5, 1.4, 2.0]} df = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in ds.items()]).toDF() df.show() method 1 (This method involves conversion of spark object to a python object (rdd to list of tuples of entire data)): inp_list=df.rdd.map(tuple).collect() # Creating a list of tuples of rows # Unpacking the list and zipping all tuples together except the header #list(zip(*inp_list))[1:] is having the transposed tuples  and list(zip(*inp_list))[0] is having the header df_transpose=spark.createDataFrame(list(zip(*inp_list))[1:],list(zip(*inp_list))[0])  df_transpose.show() method 2 (In this method only header data we are converting into python list. Rest of the transformations are carried out as...

Spark (With Python) : map() vs mapPartitions()

As a Beginner in spark, many developers will be having confusions over map() and mapPartitions() functions. This article is an attempt to resolve the confusions This blog is for : pyspark (spark with Python) Analysts and all those who are interested in learning pyspark. Pre-requesties: Should have a good knowledge in python as well as should have a basic knowledge of pyspark functions. If you use map() over an rdd , the function called  inside it will run for every record .It means if you have 10M records , function also will be executed 10M times. This is expensive especially when you are dealing with scenarios involving database connections and querying data from data base. Lets say inside map function, we have a function defined where we are connecting to a database and querying from it. Lets say our RDD is having 10M records. Now this function will execute 10M times which means 10M database connections will be created . This  is very expensive. In these kind ...