Skip to main content

Posts

Showing posts from February, 2021

Pyspark : Maintaining a history data set in SCD2 (Slowly Changing Dimension 2) model

Maintaining a history table is always a common scenario with respect to any data engineering or data warehousing project. There are numerous ways of modelling a history data set. Below is an SCD2 model implementation of history data set in pyspark. There will be a source , say  SRC , data set with daily delta records (daily changes). And history , say  HIST  , with active and expired records history In HIST:  Any record will have  START_DT,END_DT  (ACIVE_FLAG if need , but I usually ignore it since END_DT is sufficient from my persepective)   END_DT='9999-12-31'  (or null or any higher default date value) represents active records and END_DT=<past_date value> represents expired records START_DT  gives the starting date of that record HIST  dataset is  partitioned on END_DT  (since every day , we are interested only in END_DT='9999-12-31' records. So other partitions will not be touched at all) In SRC , we have :