Master key concepts with interactive flashcards
How do you handle duplicate records in Delta Lake?
Click to reveal answer
Multiple approaches: 1) **INSERT with NOT EXISTS**: `INSERT INTO target SELECT * FROM source WHERE NOT EXISTS (SELECT 1 FROM target WHERE ...)` 2) **MERGE with WHEN NOT MATCHED**: Only inserts new records 3) **Window functions**: `ROW_NUMBER() OVER (PARTITION BY key ORDER BY timestamp DESC) = 1` 4) **dropDuplicates()** in PySpark
Click to show question