渐变维度 使用 Apache Hudi 实现 SCD-2( 五 )

  1. 现在我们有一个DataFrame,它在一条记录中包含新旧数据,让我们在各自单独的DataFrame中拉取更新记录的活动和非活动实例 。


在进行上述练习时,我们将通过更改活动(新)记录的 eff_end_tsto eff_start_ts -1 并更新 actv_ind = 0 来废弃非活动记录
//Prepare Active updatesval updActiveDf = updDf.select(col("stg_seller_id").as("seller_id"),col("stg_prod_category").as("prod_category"),col("stg_product_name").as("product_name"),col("stg_product_package").as("product_package"),col("stg_discount_percentage").as("discount_percentage"),col("stg_eff_start_ts").as("eff_start_ts"),to_timestamp(lit("9999-12-31 23:59:59")) as ("eff_end_ts"),lit(1) as ("actv_ind"))updActiveDf.show(false)+---------+-------------+-----------------+---------------+-------------------+-------------------+-------------------+--------+|seller_id|prod_category|product_name     |product_package|discount_percentage|eff_start_ts       |eff_end_ts         |actv_ind|+---------+-------------+-----------------+---------------+-------------------+-------------------+-------------------+--------+|1234     |Detergent    |Tide 5L          |6              |25                 |2022-01-31 10:00:30|9999-12-31 23:59:59|1       ||4565     |Gourmet      |Dairy Milk Almond|12             |45                 |2022-06-12 20:30:40|9999-12-31 23:59:59|1       |+---------+-------------+-----------------+---------------+-------------------+-------------------+-------------------+--------+//Prepare inactive updates, which will become obsolete recordsval updInactiveDf = updDf.select(col("tgt_seller_id").as("seller_id"),col("tgt_prod_category").as("prod_category"),col("tgt_product_name").as("product_name"),col("tgt_product_package").as("product_package"),col("tgt_discount_percentage").as("discount_percentage"),col("tgt_eff_start_ts").as("eff_start_ts"),(col("stg_eff_start_ts") - expr("interval 1 seconds")).as("eff_end_ts"),lit(0) as ("actv_ind"))scala> updInactiveDf.show+---------+-------------+---------------+---------------+-------------------+-------------------+-------------------+--------+|seller_id|prod_category|   product_name|product_package|discount_percentage|       eff_start_ts|         eff_end_ts|actv_ind|+---------+-------------+---------------+---------------+-------------------+-------------------+-------------------+--------+|     1234|    Detergent|        Tide 2L|              6|                 15|2021-12-15 15:20:30|2022-01-31 10:00:29|       0||     4565|      Gourmet|Dairy Milk Silk|              6|                 30|2021-06-12 20:30:40|2022-06-12 20:30:39|       0|+---------+-------------+---------------+---------------+-------------------+-------------------+-------------------+--------+

经验总结扩展阅读