Skip to main content

Getting the Metadata Statistics of a HDFS file

 Below bash command will give you the metadata statistics such as distinct number of columns and   length of a record in a raw file. This will help you on landing phase, to determine if any raw files are corrupted.This will help you especially when there are n number of raw files are in your landing area and you need to do a quick testing of it. (The command is hard coded with .dat extension, please change according to your file type)

  1. Sample Out put of the below command is as follows:
    2. Output will be saved to a metadata file metadata_stat_<unixtimestamp>
    3. If we have more than one value for distinct_no_of_cols for a files, this means one or more records of your file is corrupted or there are some quoted strings in your records which contains delimiter itself as a value
    4. If your file is a fixed length files , distinct_no_lengths columns should have only one value per file. 

command:

metafilename=metadata_stat_$(date +%s%3N).txt;echo file_name\|distinct_no_cols\|distinct_no_lengths>$metafilename;for filename in `hadoop fs -ls /path/to/data/files/ | awk '{print $NF}' | grep .dat$ | tr '\n' ' '`; do echo $filename \|$(hadoop fs -cat $filename | awk -F "," '{print NF}'|sort|uniq|tr "\n" "," | sed 's/\(.*\),/\1 /') \|$(hadoop fs -cat $filename | awk '{print length($0)}'|sort|uniq|tr "\n" "," | sed 's/\(.*\),/\1 /') >> $metafilename;done

Note: If you have delimiters inside a quoted string (which is mentioned in 2nd point) , you can use below command 


metafilename=metadata_stat_$(date +%s%3N).txt;echo file_name\|distinct_no_cols\|distinct_no_lengths>$metafilename;for filename in `hadoop fs -ls /path/to/data/file/files | awk '{print $NF}' | grep .dat$ | tr '\n' ' '`; do echo $filename \|$(hadoop fs -cat $filename | awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print NF}'|sort|uniq|tr "\n" "," | sed 's/\(.*\),/\1 /') \|$(hadoop fs -cat $filename | awk '{print length($0)}'|sort|uniq|tr "\n" "," | sed 's/\(.*\),/\1 /') >> $metafilename;done



Comments

  1. Well written articles like yours renews my faith in today's writers. The article is very informative. Thanks for sharing such beautiful information.
    Best Data Migration tools
    Penetration testing companies USA
    What is Data Lake
    Artificial Intelligence in Banking
    What is Data analytics

    ReplyDelete

Post a Comment

Popular posts from this blog

Understanding spark architecture in Deep with YARN

OVERVIEW Apache spark is a Distributed Computing framework. By distributed it doesn’t imply that it can run only on a cluster. Spark can be configured on our local system also. But Since spark works great in clusters and in real time , it is being implemented in multi node clusters like Hadoop, we will consider a Hadoop cluster for explaining spark here. We can Execute spark on a spark cluster in following ways Interactive clients(scala shell,pyspark etc): Usually used for exploration while coding like python shell Submit a job (using spark submit utility):Always used for submitting a production application Basic Architecture Spark follows a Master/Slave Architecture. That is For every submitted application, it creates a Master Process and multiple slave processes. Master is the Driver and Slaves are the executors. Say If from a client machine, we have submitted a spark job to a cluster. Spark will create a driver process and multiple executors. Similraly...

Different ways of Transposing a Dataframe in Pyspark

When I have started coding on transposing Dataframes, I found below different methods.  I am sharing all those info here. Creation of a test Input Dataframe to be Transposed ds = {'one':[0.3, 1.2, 1.3, 1.5, 1.4, 1.0],'two':[0.6, 1.2, 1.7, 1.5, 1.4, 2.0]} df = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in ds.items()]).toDF() df.show() method 1 (This method involves conversion of spark object to a python object (rdd to list of tuples of entire data)): inp_list=df.rdd.map(tuple).collect() # Creating a list of tuples of rows # Unpacking the list and zipping all tuples together except the header #list(zip(*inp_list))[1:] is having the transposed tuples  and list(zip(*inp_list))[0] is having the header df_transpose=spark.createDataFrame(list(zip(*inp_list))[1:],list(zip(*inp_list))[0])  df_transpose.show() method 2 (In this method only header data we are converting into python list. Rest of the transformations are carried out as...

pyspark : groupByKey vs reduceByKey vs aggregateByKey - why reduceByKey and aggregateByKey is preferred in spark2

Through this article I am trying to simplify the concepts of three similar wide transformations such as groupByKey(),reduceByKey() and aggregateByKey(). This blog is for : pyspark (spark with Python) Analysts and all those who are interested in learning pyspark. Pre-requesties: Should have a good knowledge in python as well as should have a basic knowledge of pyspark functions. Data Sets used : For demonstrating purpose , I am using the below data sets (files in HDFS): "orders" with columns order_id,order_tmst,order_customer_id,order_status "orderitems" with columns order_item,order_item_order_id,product_id,order_qty,order_item_subtotal,price_per_qty Before getting in to the functionalities of transformation, we need to understand the shuffling phase in  spark. As we know rdd will be stored in-memory as multiple partitions. The transformation such as map(),flatMap(),filter() etc can be applied to each of these partitions of data independently....