pyspark : groupByKey vs reduceByKey vs aggregateByKey - why reduceByKey and aggregateByKey is preferred in spark2
Through this article I am trying to simplify the concepts of three similar wide transformations such as groupByKey(),reduceByKey() and aggregateByKey(). This blog is for : pyspark (spark with Python) Analysts and all those who are interested in learning pyspark. Pre-requesties: Should have a good knowledge in python as well as should have a basic knowledge of pyspark functions. Data Sets used : For demonstrating purpose , I am using the below data sets (files in HDFS): "orders" with columns order_id,order_tmst,order_customer_id,order_status "orderitems" with columns order_item,order_item_order_id,product_id,order_qty,order_item_subtotal,price_per_qty Before getting in to the functionalities of transformation, we need to understand the shuffling phase in spark. As we know rdd will be stored in-memory as multiple partitions. The transformation such as map(),flatMap(),filter() etc can be applied to each of these partitions of data independently....