pyspark : groupByKey vs reduceByKey vs aggregateByKey - why reduceByKey and aggregateByKey is preferred in spark2
  Through this article I am trying to simplify the concepts of three similar wide transformations such as groupByKey(),reduceByKey() and aggregateByKey().    This blog is for :  pyspark (spark with Python) Analysts and all those who are interested in learning pyspark.  Pre-requesties:  Should have a good knowledge in python as well as should have a basic knowledge of pyspark functions.   Data Sets used :   For demonstrating purpose , I am using the below data sets (files in HDFS):   "orders" with columns order_id,order_tmst,order_customer_id,order_status    "orderitems" with columns order_item,order_item_order_id,product_id,order_qty,order_item_subtotal,price_per_qty    Before getting in to the functionalities of transformation, we need to understand the shuffling phase in  spark. As we know rdd will be stored in-memory as multiple partitions.   The transformation such as map(),flatMap(),filter() etc  can be applied to each of these partitions of data independently....