Maintaining a history table is always a common scenario with respect to any data engineering or data warehousing project. There are numerous ways of modelling a history data set. Below is an SCD2 model implementation of history data set in pyspark. There will be a source , say SRC , data set with daily delta records (daily changes). And history , say HIST , with active and expired records history In HIST: Any record will have START_DT,END_DT (ACIVE_FLAG if need , but I usually ignore it since END_DT is sufficient from my persepective) END_DT='9999-12-31' (or null or any higher default date value) represents active records and END_DT=<past_date value> represents expired records START_DT gives the starting date of that record HIST dataset is partitioned on END_DT (since every day , we are interested only in END_DT='9999-12-31' records. So other partitions will not be touched at all) In SRC , we have :
Below bash command will give you the metadata statistics such as distinct number of columns and length of a record in a raw file. This will help you on landing phase, to determine if any raw files are corrupted.This will help you especially when there are n number of raw files are in your landing area and you need to do a quick testing of it. (The command is hard coded with .dat extension, please change according to your file type) Sample Out put of the below command is as follows: 2. Output will be saved to a metadata file metadata_stat_<unixtimestamp> 3. If we have more than one value for distinct_no_of_cols for a files, this means one or more records of your file is corrupted or there are some quoted strings in your records which contains delimiter itself as a value 4. If your file is a fixed length files , distinct_no_lengths columns should have only one value per file. command: metafilename=metadata_stat_$(date +%s%3N).txt;echo file_name\|distinct_no_cols\