Different ways of Transposing a Dataframe in Pyspark

When I have started coding on transposing Dataframes, I found below different methods. I am sharing all those info here.

Creation of a test Input Dataframe to be Transposed

ds = {'one':[0.3, 1.2, 1.3, 1.5, 1.4, 1.0],'two':[0.6, 1.2, 1.7, 1.5, 1.4, 2.0]}

df = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in ds.items()]).toDF()

df.show()

method 1 (This method involves conversion of spark object to a python object (rdd to list of tuples of entire data)):

inp_list=df.rdd.map(tuple).collect() # Creating a list of tuples of rows

# Unpacking the list and zipping all tuples together except the header

#list(zip(*inp_list))[1:] is having the transposed tuples and list(zip(*inp_list))[0] is having the header

df_transpose=spark.createDataFrame(list(zip(*inp_list))[1:],list(zip(*inp_list))[0])

df_transpose.show()

method 2 (In this method only header data we are converting into python list. Rest of the transformations are carried out as spark transformations):

groupedRDD=df.rdd.map(tuple).flatMap(lambda x : (x[0],x[1:])) # grouping header data and row data seperately

#Collecting the header names to a list

#Those elements of groupedRDD with data type as 'str' are header names, i.e, type(x)==str

header=groupedRDD.filter(lambda x : type(x)==str).collect()

#Those elements of groupedRDD with data type as tuple are row data ,i.e, type(x)==tuple.

#We are grouping all current rows under one key and zipping the tuples together to form transposed tuple. 'groupByKey().mapValues(lambda x : tuple(zip(*tuple(x))))',this part denotes the same.

#Using flatMap(lambda x : x[1]), we are excluding the key value and getting the tuples of transposed rows alone

df_T=groupedRDD.filter(lambda x : type(x)==tuple).map(lambda x : (1,x)).groupByKey().mapValues(lambda x : tuple(zip(*tuple(x)))).flatMap(lambda x : x[1]).toDF(header)

method 3 (Here we are going via pure rdd approach and not storing any kind of data in form of python objects):

#Here we are grouping the new header and transposed tuples and then converting them in to a dictionary.And this RDD of dictioanries , we are converting them in to dataframe

df_T=df.rdd.map(tuple).flatMap(lambda x : [(y,(x[0],z)) for y,z in enumerate(x[1:])]).groupByKey().mapValues(tuple).map(lambda x : dict(x[1])).toDF()

method 4 (This is as same as method 3 except instead of using depricated dictionary to DF conversion, we are using Row object to DF conversion):

#Here we are grouping the new header and transposed tuples and then converting them in to Row

Object.And this RDD of dictioanries , we are converting them in to dataframe

df_T=df.rdd.map(tuple).flatMap(lambda x : [(y,(x[0],z)) for y,z in enumerate(x[1:])]).groupByKey().mapValues(tuple).map(lambda x : Row(**dict(x[1]))).toDF()

Comments

AnonymousOctober 1, 2020 at 4:48 AM
Wow it is truly great and marvelous in this manner it is a lot of valuable for me to comprehend numerous ideas and helped me a ton. it is truly logical well overall and I got more data from your blog.

Data Engineering Services
ReplyDelete
Replies
360DigiTMGAurangabadApril 21, 2021 at 4:49 AM
This Blog is very useful and informative.
best machine learning course in aurangabad
ReplyDelete
Replies
jeya sofiaMay 31, 2021 at 2:29 AM
Excellent Blog, I like your blog and It is very informative. Thank you

Pyspark online Training
Learn Pyspark Online

ReplyDelete
Replies
360DigiTMGAurangabadJuly 28, 2021 at 12:53 AM
thanks for share
artificial intelligence course aurangabad
ReplyDelete
Replies
GowriFebruary 4, 2022 at 1:32 AM
Thanks for your information is so helpful, and your personal comments are so easy to follow.

SEO Service in Chennai
SEO Companies in Chennai
Chennai SEO company
ReplyDelete
Replies
EsthackiFebruary 24, 2022 at 3:56 AM
Your information was fantastic. I found unknown information from this post. thanks for sharing like this.
Bulk SMS Service in Chennai
bulk sms service chennai
sms service provider in chennai
bulk sms price in chennai
bulk sms provider chennai
bulk sms service provider in chennai

ReplyDelete
Replies
App Development From Concept to RealityJanuary 2, 2025 at 7:59 AM
Sorting out different methods of transposing a DataFrame in PySpark can be quite helpful exploring different ways of manipulating and analysing data. The advantage of pivot operation or select and map functions based on the data structure and needs are different. If your business wants to take full advantage of these techniques, a web development company in new york can help you implement an effective data processing pipeline, making it possible to build rugged and scalable means of data driven decision making.
ReplyDelete
Replies

Add comment

Big Data Analytics

Search This Blog

Different ways of Transposing a Dataframe in Pyspark

Comments

Post a Comment

Popular posts from this blog

Understanding spark architecture in Deep with YARN

Spark (With Python) : map() vs mapPartitions()