Skip to main content

Posts

Showing posts from August, 2020

Getting the Metadata Statistics of a HDFS file

 Below bash command will give you the metadata statistics such as distinct number of columns and   length of a record in a raw file. This will help you on landing phase, to determine if any raw files are corrupted.This will help you especially when there are n number of raw files are in your landing area and you need to do a quick testing of it. (The command is hard coded with .dat extension, please change according to your file type) Sample Out put of the below command is as follows:     2. Output will be saved to a metadata file metadata_stat_<unixtimestamp>      3. If we have more than one value for distinct_no_of_cols for a files, this means one or more records of your file is corrupted or there are some quoted strings in your records which contains delimiter itself as a value      4. If your file is a fixed length files , distinct_no_lengths columns should have only one value per file.  command: metafilename=metadata_stat_$(date +%s%3N).txt;echo file_name\|distinct_no_cols\