Big Data Analytics

Pyspark : Maintaining a history data set in SCD2 (Slowly Changing Dimension 2) model

Maintaining a history table is always a common scenario with respect to any data engineering or data warehousing project. There are numerous ways of modelling a history data set. Below is an SCD2 model implementation of history data set in pyspark. There will be a source , say SRC , data set with daily delta records (daily changes). And history , say HIST , with active and expired records history In HIST: Any record will have START_DT,END_DT (ACIVE_FLAG if need , but I usually ignore it since END_DT is sufficient from my persepective) END_DT='9999-12-31' (or null or any higher default date value) represents active records and END_DT=<past_date value> represents expired records START_DT gives the starting date of that record HIST dataset is partitioned on END_DT (since every day , we are interested only in END_DT='9999-12-31' records. So other partitions will not be touched at all) In SRC , we have :

Big Data Analytics

Search This Blog

Posts

Pyspark : Maintaining a history data set in SCD2 (Slowly Changing Dimension 2) model