Skip to main content

Posts

Showing posts from November, 2018

pyspark : groupByKey vs reduceByKey vs aggregateByKey - why reduceByKey and aggregateByKey is preferred in spark2

Through this article I am trying to simplify the concepts of three similar wide transformations such as groupByKey(),reduceByKey() and aggregateByKey(). This blog is for : pyspark (spark with Python) Analysts and all those who are interested in learning pyspark. Pre-requesties: Should have a good knowledge in python as well as should have a basic knowledge of pyspark functions. Data Sets used : For demonstrating purpose , I am using the below data sets (files in HDFS): "orders" with columns order_id,order_tmst,order_customer_id,order_status "orderitems" with columns order_item,order_item_order_id,product_id,order_qty,order_item_subtotal,price_per_qty Before getting in to the functionalities of transformation, we need to understand the shuffling phase in  spark. As we know rdd will be stored in-memory as multiple partitions. The transformation such as map(),flatMap(),filter() etc can be applied to each of these partitions of data independently.

pyspark : Dealing with partitions

This is a small article in which I am trying to explain on how partitions are created ,how to know its numbers,how to control its numbers ..etc.  This blog is for : pyspark (spark with Python) Analysts and all those who are interested in learning pyspark. Pre-requesties: Should have a good knowledge in python as well as should have a basic knowledge of pyspark functions. Data Sets used : For demonstrating purpose , I am using the below data set (file in HDFS): "orderitems" with columns order_item,order_item_order_id,product_id,order_qty,order_item_subtotal,price_per_qty If you are reading from a file to rdd in HDFS , by default it will create number of partitions equals number of blocks it has to read. Say Default block size of HDFS (2.x)is 128mb, then if a file less than this size  will be read to rdd by spark as 1 partition since only one block is used. And if it is 256mb file, then 2 partitions and so on. To know the current number of partitions of an rdd,

Spark (With Python) : map() vs mapPartitions()

As a Beginner in spark, many developers will be having confusions over map() and mapPartitions() functions. This article is an attempt to resolve the confusions This blog is for : pyspark (spark with Python) Analysts and all those who are interested in learning pyspark. Pre-requesties: Should have a good knowledge in python as well as should have a basic knowledge of pyspark functions. If you use map() over an rdd , the function called  inside it will run for every record .It means if you have 10M records , function also will be executed 10M times. This is expensive especially when you are dealing with scenarios involving database connections and querying data from data base. Lets say inside map function, we have a function defined where we are connecting to a database and querying from it. Lets say our RDD is having 10M records. Now this function will execute 10M times which means 10M database connections will be created . This  is very expensive. In these kind of scenar