Below bash command will give you the metadata statistics such as distinct number of columns and length of a record in a raw file. This will help you on landing phase, to determine if any raw files are corrupted.This will help you especially when there are n number of raw files are in your landing area and you need to do a quick testing of it. (The command is hard coded with .dat extension, please change according to your file type)
- Sample Out put of the below command is as follows:
2. Output will be saved to a metadata file metadata_stat_<unixtimestamp>
3. If we have more than one value for distinct_no_of_cols for a files, this means one or more records of your file is corrupted or there are some quoted strings in your records which contains delimiter itself as a value
4. If your file is a fixed length files , distinct_no_lengths columns should have only one value per file.
command:
metafilename=metadata_stat_$(date +%s%3N).txt;echo file_name\|distinct_no_cols\|distinct_no_lengths>$metafilename;for filename in `hadoop fs -ls /path/to/data/files/ | awk '{print $NF}' | grep .dat$ | tr '\n' ' '`; do echo $filename \|$(hadoop fs -cat $filename | awk -F "," '{print NF}'|sort|uniq|tr "\n" "," | sed 's/\(.*\),/\1 /') \|$(hadoop fs -cat $filename | awk '{print length($0)}'|sort|uniq|tr "\n" "," | sed 's/\(.*\),/\1 /') >> $metafilename;done
Note: If you have delimiters inside a quoted string (which is mentioned in 2nd point) , you can use below command
metafilename=metadata_stat_$(date +%s%3N).txt;echo file_name\|distinct_no_cols\|distinct_no_lengths>$metafilename;for filename in `hadoop fs -ls /path/to/data/file/files | awk '{print $NF}' | grep .dat$ | tr '\n' ' '`; do echo $filename \|$(hadoop fs -cat $filename | awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print NF}'|sort|uniq|tr "\n" "," | sed 's/\(.*\),/\1 /') \|$(hadoop fs -cat $filename | awk '{print length($0)}'|sort|uniq|tr "\n" "," | sed 's/\(.*\),/\1 /') >> $metafilename;done
Well written articles like yours renews my faith in today's writers. The article is very informative. Thanks for sharing such beautiful information.
ReplyDeleteBest Data Migration tools
Penetration testing companies USA
What is Data Lake
Artificial Intelligence in Banking
What is Data analytics