Load Slowly Changing Dimension Type 2 using Oracle Merge Statement
By using Oracle Merge statement, we are able to perform an insert and update statements (sometime referred to as “upsert”) in one query. The merge statement primarily useful in data warehousing situations, especially when maintaining Type 2 Slowly changing dimension (SCD2), where large amounts of data are commonly inserted and updated
1. Tables
To explore how the merge statement works, let’s use the following set of tables
Product dimension table: dim_product.
CREATE TABLE DIM_PRODUCT
(
PRODUCT_SID NUMBER NOT NULL,
PRODUCT_ID NUMBER NOT NULL,
PRODUCT_CATEGORY_ID NUMBER,
PRODUCT_NAME VARCHAR2(50 CHAR),
PRODUCT_DESCRIPTION VARCHAR2(200 CHAR),
PRICE NUMBER(8,2),
PRICE_EFFECTIVE_DATE DATE,
LAST_MODIFIED_DATE DATE,
EFFECTIVE_START_DATE DATE,
EFFECTIVE_END_DATE DATE,
IS_CURRENT_ROW VARCHAR2(1 CHAR)
CREATED_DATE DATE,
UPDATED_DATE DATE
);
Source stage table: stg_product.
This is a stage table use to store the newly modified product records from the product table.
CREATE TABLE STG_PRODUCT
(
PRODUCT_ID NUMBER NOT NULL,
PRODUCT_CATEGORY_ID NUMBER,
PRODUCT_NAME VARCHAR2(50 CHAR),
PRODUCT_DESCRIPTION VARCHAR2(200 CHAR),
PRICE NUMBER(8,2),
PRICE_EFFECTIVE_DATE DATE,
LAST_MODIFIED_DATE DATE,
CREATED_DATE DATE
);
Let’s also created a sequence to use for the dim_product dimension table ‘s surrogate keys.
CREATE SEQUENCE s_dim_product
START WITH 1
MAXVALUE 9999999999999999999999999999
MINVALUE 1
NOCYCLE
CACHE 20
NOORDER;
2. SCD Logics and the Merge Query
A slowly changing dimension type 2 (SCD2) is defined to keep track changes of attributes in the underlying source table. While there are many attributes in a table, the business rules decide what changes to track. Here, for demonstration purpose, let’s choose to keep track of changes of each attribute in the underlying product source table, which is to keep history of production name, product description, and product category and price changes.
The merger statement provides the abilities to perform insert and update with only one select statement based upon a join condition. Rows already in the target table that match the join condition are updated; otherwise, a row is inserted using values from a separate subquery. The details of the merge statement can be found from the
oracle documentation online.
http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/statements_9016.htm
An SCD2 normally has three additional columns. The data effective start and end date of the record, and a flag to indicate the current record. To maintain histories, a new record is created to store the changing attributes while we keep the exiting records as histories. To help achieve that, a small utility table with only tow records in it is created. In this utility table, one record indicates the insert flag and the other indicates the update flag. Here the table structure:
Utility table: scd_row_type.
CREATE TABLE SCD_ROW_TYPE
(
SCD_ROW_TYPE_ID NUMBER NOT NULL,
SCD_ROW_TYPE_DESCRIPTION VARCHAR2(20 CHAR)
);
After creating the table, insert the following records into the table:
Insert into scd_row_type(scd_row_type_id,scd_row_description) values(1,’For Insert’);
Insert into scd_row_type(scd_row_type_id,scd_row_description) values(2,’For Update’);
Now we are ready to construct the merge query
MERGE INTO dim_product p
USING ( SELECT DECODE(s.scd_row_type_id,1,-6789,m.product_sid) as product_sid,
PRODUCT_ID,
PRODUCT_CATEGORY_ID,
PRODUCT_NAME,
PRODUCT_DESCRIPTION,
PRICE,
PRICE_EFFECTIVE_DATE,
LAST_MODIFIED_DATE,
CREATED_DATE,
m.scd_row_type_id
FROM (SELECT dp.product_sid,
sp.PRODUCT_ID,
sp.PRODUCT_CATEGORY_ID,
sp.PRODUCT_NAME,
sp.PRODUCT_DESCRIPTION,
sp.PRICE,
sp.PRICE_EFFECTIVE_DATE,
sp.LAST_MODIFIED_DATE,
sp.CREATED_DATE,
CASE
WHEN dp.product_id IS NULL
THEN
1
WHEN (dp.product_category_id !=
sp.product_category_id
OR dp.product_name != sp.product_name
OR DP.PRODUCT_DESCRIPTION !=
sp.product_description
OR dp.price != sp.price
OR dp.price_effective_date !=
sp.price_effective_date)
THEN
2
ELSE
0
END
AS scd_row_type_id
FROM stg_product sp
LEFT JOIN
Dim_product dp
ON (sp.product_id = dp.product_id and dp.is_current_row = 'Y')
) m
JOIN scd_row_type s
ON (s.scd_row_type_id <= m.scd_row_type_id)
) mp
ON (p.product_sid = mp.product_sid)
when matched then
update set P.EFFECTIVE_END_DATE = mp.LAST_MODIFIED_DATE, is_current_row = 'N', updated_date = sysdate
when NOT matched then
insert (P.PRODUCT_SID,P.PRODUCT_ID,P.PRODUCT_CATEGORY_ID, P.PRODUCT_NAME, P.PRODUCT_DESCRIPTION, P.PRICE, p.PRICE_EFFECTIVE_DATE,P.LAST_MODIFIED_DATE,
p.effective_start_date,P.EFFECTIVE_END_DATE,is_current_row, created_date,
updated_date )
values (s_dim_product.nextval,mp.PRODUCT_ID,mp.PRODUCT_CATEGORY_ID,mp.PRODUCT_NAME,
mp.PRODUCT_DESCRIPTION, mp.PRICE,mp.price_effective_date,
mp.LAST_MODIFIED_DATE,mp.last_modified_date,
to_date('2099-12-31 00:00:00','YYYY-MM-DD HH24:MI:SS'),’Y', sysdate,sysdate);
commit;
3. How it works
To make it easier to understand how the merge statement work when we actually perform some tests with data. First, let’s issue the following insert statements to create two records in the stage table.
truncate table stg_product;
insert into stg_product ( PRODUCT_ID, PRODUCT_CATEGORY_ID, PRODUCT_NAME,PRODUCT_DESCRIPTION,PRICE,PRICE_EFFECTIVE_DATE, LAST_MODIFIED_DATE, CREATED_DATE)
values (1,8,'Apple iPad 36 GB','Apple iPad 36 GB with case', 800,sysdate,sysdate,sysdate);
insert into stg_product ( PRODUCT_ID, PRODUCT_CATEGORY_ID, PRODUCT_NAME,PRODUCT_DESCRIPTION,PRICE,PRICE_EFFECTIVE_DATE, LAST_MODIFIED_DATE,CREATED_DATE)
values (2,7,'Canon 12 MPIX Digital Camera','Canon 12 MPIX Digital Camera, Zoon Len, Case', 150,sysdate,sysdate,sysdate);
commit;
Check the stage table to make sure the records are there.
Sqlplus>select * from stg_product;
Now run the merge statement which was created in section 2. Once the merge query statement was executed, run the following query to checked the records in the dim_product table.
Sqlplus>select * from dim_product;
There should be two rows in the dimension table with the column effective_start_date, effective_end_date and is_current_row populated. Now, let’s find out how the merge statement handles SCD Type 2 records. To begin, run the following queries against the stage table:
update stg_product set price = 700 , price_effective_date = sysdate, last_modified_date = sysdate where product_id = 1;
insert into stg_product (
PRODUCT_ID,
PRODUCT_CATEGORY_ID,
PRODUCT_NAME,
PRODUCT_DESCRIPTION,
PRICE,
PRICE_EFFECTIVE_DATE,
LAST_MODIFIED_DATE,
CREATED_DATE)
values (3,6,'EPSon Color Printer',
'EPSon color print with high definition photo prints', 200,sysdate,sysdate,sysdate);
commit;
The above statements updated a record where product_id equal to 1 to change it price to 700, and inserted one more new record, product_id 3, into the stg_product table. To help understand what the records set looks like from the using clause of the merge statement, let’s copy the query and execute it to get the record set.
SELECT DECODE(s.scd_row_type_id,1,-6789,m.product_sid) as product_sid,
PRODUCT_ID,
PRODUCT_CATEGORY_ID,
PRODUCT_NAME,
PRODUCT_DESCRIPTION,
PRICE,
PRICE_EFFECTIVE_DATE,
LAST_MODIFIED_DATE,
CREATED_DATE,
m.scd_row_type_id,
s.scd_row_type_id
FROM (SELECT dp.product_sid,
sp.PRODUCT_ID,
sp.PRODUCT_CATEGORY_ID,
sp.PRODUCT_NAME,
sp.PRODUCT_DESCRIPTION,
sp.PRICE,
sp.PRICE_EFFECTIVE_DATE,
sp.LAST_MODIFIED_DATE,
sp.CREATED_DATE,
CASE
WHEN dp.product_id IS NULL
THEN
1
WHEN (dp.product_category_id !=
sp.product_category_id
OR dp.product_name != sp.product_name
OR DP.PRODUCT_DESCRIPTION !=
sp.product_description
OR dp.price != sp.price
OR dp.price_effective_date !=
sp.price_effective_date)
THEN
2
ELSE
0
END
AS scd_row_type_id
FROM stg_product sp
LEFT JOIN
Dim_product dp
ON (sp.product_id = dp.product_id and dp.is_current_row = 'Y')
) m
JOIN scd_row_type s
ON (s.scd_row_type_id <= m.scd_row_type_id)
Examine the result set, it should look something like:
PRODUCT_SID PRODUCT_ID PRODUCT_CATEGORY_ID PRODUCT_NAME PRODUCT_DESCRIPTION PRICE PRICE_EFFECTIVE_DATE LAST_MODIFIED_DATE CREATED_DATE SCD_ROW_TYPE_ID
-6789 1 8 Apple iPad 36 GB Apple iPad 36 GB with the latest createst 700 4/19/2011 2:11 4/19/2011 2:11 4/19/2011 2:10 2
62 1 8 Apple iPad 36 GB Apple iPad 36 GB with the latest createst 700 4/19/2011 2:11 4/19/2011 2:11 4/19/2011 2:10 2
3 6 EPSon Color Printer EPSon color print with high definition photo prints 200 4/19/2011 2:11 4/19/2011 2:11 4/19/2011 2:11 1
As you can see, we have two rows for product_id 1, and one row for the new record, product_id 3. Also, at the end of the row, you can also see that the scd_record_type_id column has two ids for product id 1. One is 2 and the other one is 1. for the one the that has a scd_record_type_id equql to 1, the product_sid is changed to –6789. Therefore, this record will be used for insert. The logic that define which record to update or insert was defined in the case statement:
(CASE
WHEN dp.product_id IS NULL
THEN 1
WHEN (dp.product_category_id != sp.product_category_id
OR dp.product_name != sp.product_name
OR DP.PRODUCT_DESCRIPTION != sp.product_description
OR dp.price != sp.price
OR dp.price_effective_date != sp.price_effective_date)
THEN 2
ELSE 0
END) AS scd_row_type_id
If the product_id is not existing in the dim_product table, it is a new record. Therefore, it is assigned a 1 and it will be inserted. If any of the changes occurred to an existing record, it is an update and is assigned a 2. Otherwise, any identical record will be assign a 0. The join statement:
JOIN scd_row_type s
ON (s.scd_row_type_id <= m.scd_row_type_id)
takes only record with a scd_record_type_id that is 1 or greater since the scd_record_type table only have two records, scd_record_type_id 1 and 2. Therefore, any record that with a scd_record_type = 0 (identically records) will not be included in the final result set. To create a type 2 record, which is a existing record that has one or more new attributes, we change the existing product_sid to a dummy number, -6789, by using the decode statement:
DECODE(s.scd_row_type_id,1,-6789,m.product_sid) as product_sid
We choose this number because we had define the surrogate key of the dim_product table to only use positive value for keys. Hence, it will be inserted into the dimension. The "when matched" and "when not matched" clauses in the merge statement are self-explanatory. We matched the record set with the dim_product table on the product_sid key. If it matched, we set the is_current_row flag to ‘N’ and sent the effective_end_date to the last_modified_date. If it was not matched, records will be insert into the dimension table. Now, run the merge statement in section 2 to load the data into the dimension table.
4. Summary
As it is stated in the
Oracle documentation, the
MERGE statement can be used to select rows from one or more sources for update or insertion into a table or view. You can specify conditions to determine whether to update or insert into the target table or view.
This statement is a convenient way to combine multiple operations. It lets you avoid multiple INSERT, UPDATE, and DELETE DML statements. Therefore, it is helpful in loading data into a slowly changing dimension type 2 table.
MERGE is a deterministic statement. You cannot update the same row of the target table multiple times in the same MERGE statement. Oracle Database does not implement fine-grained access control during MERGE statements.
If you are using the fine-grained access control feature on the target table or tables, then use equivalent INSERT and UPDATE statements instead of MERGE to avoid error messages and to ensure correct access control.
Reference:
Oracle Online Documentations
Oracle Database 11g The Complete Reference,Kevin Loney, Oracle Press
The Data Warehouse toolkit, Ralph Kimball and Margy Ross
The Data Warehouse ETL Toolkit, Ralph Kimball and Joe Caserta
Other Reference on the Web
Related Note:
SQL Outer Join