Skip to main content

Posts

Showing posts from September, 2018

Setting up pyspark 2.3.1 on your ubuntu 18.04

To set up Pyspark 2.3.1 on ubuntu , you need to have java, 1.8+,scala,python 3.5.x and py4j package need to be installed.And as an IDE , we will be using jupyter-notebook here. Setting up pyspark in ubuntu is very simple if you follow the below steps in order Open your Terminal on ubuntu OS ubuntu 18.04 will have python 3.5.x by default or else you need install it as per the instructions in the below link(its a very good discussion forum for installation): https://askubuntu.com/questions/865554/how-do-i-install-python-3-6-using-apt-get To keep the latest updates in sync with your os, run the below command: sudo apt-get update once update is done, install java 1.8 using below command: sudo apt install openjdk-8-jre-headless Once java is installed validate it by using java --version: it should display jdk version as 1.8.x once validated, install scala as below: sudo apt-get install scala once installation is done,validate it using scala --version: it should display the late

Pyspark : Read File to RDD and convert to Data Frame

Through this blog, I am trying to explain different ways of creating RDDs from reading files and then creating Data Frames out of RDDs. This blog is for : pyspark (spark with Python) Analysts and all those who are interested in learning pyspark. Pre-requesties: Should have a good knowledge in python as well as should have a basic knowledge of pyspark RDD(Resilient Distributed Datasets): It is an immutable distributed collection of objects. This is the fundamental data structure of spark.By Default when you will read from a file using sparkContext, its converted in to an RDD with each lines as elements of type string.But this lacks of an organised structure Data Frames :  This is created actually for higher-level abstraction by imposing a structure to the above distributed collection.Its having rows and columns (almost similar to pandas).from  spark 2.3.x, Data frames and data sets are more popular and has been used more that RDDs. Learn in more detail here :  ht