Getting the Metadata Statistics of a HDFS file

Below bash command will give you the metadata statistics such as distinct number of columns and length of a record in a raw file. This will help you on landing phase, to determine if any raw files are corrupted.This will help you especially when there are n number of raw files are in your landing area and you need to do a quick testing of it. (The command is hard coded with .dat extension, please change according to your file type)

Sample Out put of the below command is as follows:

2. Output will be saved to a metadata file metadata_stat_<unixtimestamp>

3. If we have more than one value for distinct_no_of_cols for a files, this means one or more records of your file is corrupted or there are some quoted strings in your records which contains delimiter itself as a value

4. If your file is a fixed length files , distinct_no_lengths columns should have only one value per file.

command:

metafilename=metadata_stat_$(date +%s%3N).txt;echo file_name\|distinct_no_cols\|distinct_no_lengths>$metafilename;for filename in `hadoop fs -ls /path/to/data/files/ | awk '{print $NF}' | grep .dat$ | tr '\n' ' '`; do echo $filename \|$(hadoop fs -cat $filename | awk -F "," '{print NF}'|sort|uniq|tr "\n" "," | sed 's/$.*$,/\1 /') \|$(hadoop fs -cat $filename | awk '{print length($0)}'|sort|uniq|tr "\n" "," | sed 's/$.*$,/\1 /') >> $metafilename;done

Note: If you have delimiters inside a quoted string (which is mentioned in 2nd point) , you can use below command

metafilename=metadata_stat_$(date +%s%3N).txt;echo file_name\|distinct_no_cols\|distinct_no_lengths>$metafilename;for filename in `hadoop fs -ls /path/to/data/file/files | awk '{print $NF}' | grep .dat$ | tr '\n' ' '`; do echo $filename \|$(hadoop fs -cat $filename | awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print NF}'|sort|uniq|tr "\n" "," | sed 's/$.*$,/\1 /') \|$(hadoop fs -cat $filename | awk '{print length($0)}'|sort|uniq|tr "\n" "," | sed 's/$.*$,/\1 /') >> $metafilename;done