Skip to main content

Setting up pyspark 2.3.1 on your ubuntu 18.04

To set up Pyspark 2.3.1 on ubuntu , you need to have java, 1.8+,scala,python 3.5.x and py4j package need to be installed.And as an IDE , we will be using jupyter-notebook here.

Setting up pyspark in ubuntu is very simple if you follow the below steps in order



  1. Open your Terminal on ubuntu OS
  2. ubuntu 18.04 will have python 3.5.x by default or else you need install it as per the instructions in the below link(its a very good discussion forum for installation):

    https://askubuntu.com/questions/865554/how-do-i-install-python-3-6-using-apt-get
  3. To keep the latest updates in sync with your os, run the below command:
    sudo apt-get update
  4. once update is done, install java 1.8 using below command:
    sudo apt install openjdk-8-jre-headless
  5. Once java is installed validate it by using java --version:
    it should display jdk version as 1.8.x
  6. once validated, install scala as below:
    sudo apt-get install scala
  7. once installation is done,validate it using scala --version:
    it should display the latest scala version
  8. install the pip3 using below command:
    pip3 install py4j
  9. Once it is done,validate it just by running pip3:
    it should display all available help options for pip3
  10. Now install the IDE,jupyter, by entering :
    sudo apt install jupyter
  11. once installation completes, validate it by entering jupyter-notebook:
    it should open a firefox terminal for jupyter notebook with a running session on terminal
  12. Now quit the terminal using ctrl+c and close the firefox window also
  13. Now download the tar file for spark 2.3.1 (stable version of spark as of this date) from below link:
    https://www.apache.org/dyn/closer.lua/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
  14. Copy the downloaded tar file to your home directory
  15. extract it using below command:
    sudo tar -zxvf spark-2.3.1-bin-hadoop2.7.tgz
  16. Now next step is to define and export the environmental variables permanently for the user:
    For defining and exporting env variables permanently in ubuntu please add the below lines to your ~/.bashrc file and save it :
    First replace the 'user_name' with yours  user and copy the entire lines from below:

    export SPARK_HOME='/home/user_name/spark-2.3.1-bin-hadoop2.7'
    export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
    export PYSPARK_DRIVER_PYTHON='jupyter'
    export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
    export PYSPARK_PYTHONPATH=python3
    export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-amd64"
    export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
  17. Now open .bashrc using sudo nano ~/.bashrc paste the copied content to the bottom most line of the file and exit it by ctrl+x (save and exit).
  18. Now restart your terminal
  19. Enter python3 to open the python3 terminal
  20. Now type below command and press enter:
    import pyspark

    cursor shoould move to next command line without any error.This means you have successfully completed pyspark 2.3.1 set up on ubuntu. If it throws some error, it means , you have messed up some where in between.

Comments

Popular posts from this blog

Understanding spark architecture in Deep with YARN

OVERVIEW Apache spark is a Distributed Computing framework. By distributed it doesn’t imply that it can run only on a cluster. Spark can be configured on our local system also. But Since spark works great in clusters and in real time , it is being implemented in multi node clusters like Hadoop, we will consider a Hadoop cluster for explaining spark here. We can Execute spark on a spark cluster in following ways Interactive clients(scala shell,pyspark etc): Usually used for exploration while coding like python shell Submit a job (using spark submit utility):Always used for submitting a production application Basic Architecture Spark follows a Master/Slave Architecture. That is For every submitted application, it creates a Master Process and multiple slave processes. Master is the Driver and Slaves are the executors. Say If from a client machine, we have submitted a spark job to a cluster. Spark will create a driver process and multiple executors. Similraly  if

Pyspark : Read File to RDD and convert to Data Frame

Through this blog, I am trying to explain different ways of creating RDDs from reading files and then creating Data Frames out of RDDs. This blog is for : pyspark (spark with Python) Analysts and all those who are interested in learning pyspark. Pre-requesties: Should have a good knowledge in python as well as should have a basic knowledge of pyspark RDD(Resilient Distributed Datasets): It is an immutable distributed collection of objects. This is the fundamental data structure of spark.By Default when you will read from a file using sparkContext, its converted in to an RDD with each lines as elements of type string.But this lacks of an organised structure Data Frames :  This is created actually for higher-level abstraction by imposing a structure to the above distributed collection.Its having rows and columns (almost similar to pandas).from  spark 2.3.x, Data frames and data sets are more popular and has been used more that RDDs. Learn in more detail here :  ht

Different ways of Transposing a Dataframe in Pyspark

When I have started coding on transposing Dataframes, I found below different methods.  I am sharing all those info here. Creation of a test Input Dataframe to be Transposed ds = {'one':[0.3, 1.2, 1.3, 1.5, 1.4, 1.0],'two':[0.6, 1.2, 1.7, 1.5, 1.4, 2.0]} df = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in ds.items()]).toDF() df.show() method 1 (This method involves conversion of spark object to a python object (rdd to list of tuples of entire data)): inp_list=df.rdd.map(tuple).collect() # Creating a list of tuples of rows # Unpacking the list and zipping all tuples together except the header #list(zip(*inp_list))[1:] is having the transposed tuples  and list(zip(*inp_list))[0] is having the header df_transpose=spark.createDataFrame(list(zip(*inp_list))[1:],list(zip(*inp_list))[0])  df_transpose.show() method 2 (In this method only header data we are converting into python list. Rest of the transformations are carried out as