Wednesday, April 20, 2011

Load Slowly Changing Dimension Type 2 using Oracle Merge Statement

 Load Slowly Changing Dimension Type 2 using Oracle Merge Statement

By using Oracle Merge statement, we are able to perform an insert and update statements (sometime referred to as “upsert”) in one query. The merge statement primarily useful in data warehousing situations, especially when maintaining Type 2 Slowly changing dimension (SCD2), where large amounts of data are commonly inserted and updated

1. Tables

To explore how the merge statement works, let’s use the following set of tables

Product dimension table: dim_product.


CREATE TABLE DIM_PRODUCT
(
PRODUCT_SID NUMBER NOT NULL,
PRODUCT_ID NUMBER NOT NULL,
PRODUCT_CATEGORY_ID NUMBER,
PRODUCT_NAME VARCHAR2(50 CHAR),
PRODUCT_DESCRIPTION VARCHAR2(200 CHAR),
PRICE NUMBER(8,2),
PRICE_EFFECTIVE_DATE DATE,
LAST_MODIFIED_DATE DATE,
EFFECTIVE_START_DATE DATE,
EFFECTIVE_END_DATE DATE,
IS_CURRENT_ROW VARCHAR2(1 CHAR)
CREATED_DATE DATE,
UPDATED_DATE DATE
);

Source stage table: stg_product.
This is a stage table use to store the newly modified product records from the product table.

CREATE TABLE STG_PRODUCT
(
PRODUCT_ID NUMBER NOT NULL,
PRODUCT_CATEGORY_ID NUMBER,
PRODUCT_NAME VARCHAR2(50 CHAR),
PRODUCT_DESCRIPTION VARCHAR2(200 CHAR),
PRICE NUMBER(8,2),
PRICE_EFFECTIVE_DATE DATE,
LAST_MODIFIED_DATE DATE,
CREATED_DATE DATE
);

Let’s also created a sequence to use for the dim_product dimension table ‘s surrogate keys.

CREATE SEQUENCE s_dim_product
START WITH 1
MAXVALUE 9999999999999999999999999999
MINVALUE 1
NOCYCLE
CACHE 20
NOORDER;

2. SCD Logics and the Merge Query

A slowly changing dimension type 2 (SCD2) is defined to keep track changes of attributes in the underlying source table. While there are many attributes in a table, the business rules decide what changes to track. Here, for demonstration purpose, let’s choose to keep track of changes of each attribute in the underlying product source table, which is to keep history of production name, product description, and product category and price changes.

The merger statement provides the abilities to perform insert and update with only one select statement based upon a join condition. Rows already in the target table that match the join condition are updated; otherwise, a row is inserted using values from a separate subquery. The details of the merge statement can be found from the oracle documentation online.

http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/statements_9016.htm

An SCD2 normally has three additional columns. The data effective start and end date of the record, and a flag to indicate the current record. To maintain histories, a new record is created to store the changing attributes while we keep the exiting records as histories. To help achieve that, a small utility table with only tow records in it is created. In this utility table, one record indicates the insert flag and the other indicates the update flag. Here the table structure:

Utility table: scd_row_type.

CREATE TABLE SCD_ROW_TYPE
(
SCD_ROW_TYPE_ID NUMBER NOT NULL,
SCD_ROW_TYPE_DESCRIPTION VARCHAR2(20 CHAR)
);

After creating the table, insert the following records into the table:

Insert into scd_row_type(scd_row_type_id,scd_row_description) values(1,’For Insert’);
Insert into scd_row_type(scd_row_type_id,scd_row_description) values(2,’For Update’);

Now we are ready to construct the merge query
MERGE INTO dim_product p
USING ( SELECT DECODE(s.scd_row_type_id,1,-6789,m.product_sid) as product_sid,
                   PRODUCT_ID,
                   PRODUCT_CATEGORY_ID,
                   PRODUCT_NAME,
                   PRODUCT_DESCRIPTION,
                   PRICE,
                   PRICE_EFFECTIVE_DATE,
                   LAST_MODIFIED_DATE,
                   CREATED_DATE,
                   m.scd_row_type_id
              FROM    (SELECT dp.product_sid,
                              sp.PRODUCT_ID,
                              sp.PRODUCT_CATEGORY_ID,
                              sp.PRODUCT_NAME,
                              sp.PRODUCT_DESCRIPTION,
                              sp.PRICE,
                              sp.PRICE_EFFECTIVE_DATE,
                              sp.LAST_MODIFIED_DATE,
                              sp.CREATED_DATE,
                              CASE
                                 WHEN dp.product_id IS NULL
                                 THEN
                                    1
                                 WHEN (dp.product_category_id !=
                                          sp.product_category_id
                                       OR dp.product_name != sp.product_name
                                       OR DP.PRODUCT_DESCRIPTION !=
                                             sp.product_description
                                       OR dp.price != sp.price
                                       OR dp.price_effective_date !=
                                             sp.price_effective_date)
                                 THEN
                                    2
                                 ELSE
                                    0
                              END
                                 AS scd_row_type_id
                         FROM    stg_product sp
                              LEFT JOIN
                                 Dim_product dp
                              ON (sp.product_id = dp.product_id and  dp.is_current_row = 'Y')
                     ) m
                   JOIN  scd_row_type s
                   ON (s.scd_row_type_id <= m.scd_row_type_id)
                   ) mp
        ON  (p.product_sid = mp.product_sid)
when matched then
   update set P.EFFECTIVE_END_DATE = mp.LAST_MODIFIED_DATE, is_current_row = 'N', updated_date = sysdate
when NOT matched then
insert (P.PRODUCT_SID,P.PRODUCT_ID,P.PRODUCT_CATEGORY_ID, P.PRODUCT_NAME, P.PRODUCT_DESCRIPTION, P.PRICE, p.PRICE_EFFECTIVE_DATE,P.LAST_MODIFIED_DATE,
 p.effective_start_date,P.EFFECTIVE_END_DATE,is_current_row, created_date,
 updated_date )
 values (s_dim_product.nextval,mp.PRODUCT_ID,mp.PRODUCT_CATEGORY_ID,mp.PRODUCT_NAME,
 mp.PRODUCT_DESCRIPTION, mp.PRICE,mp.price_effective_date,
mp.LAST_MODIFIED_DATE,mp.last_modified_date,
 to_date('2099-12-31 00:00:00','YYYY-MM-DD HH24:MI:SS'),’Y', sysdate,sysdate);
  commit;


3. How it works

To make it easier to understand how the merge statement work when we actually perform some tests with data. First, let’s issue the following insert statements to create two records in the stage table.

truncate table stg_product;

insert into stg_product ( PRODUCT_ID, PRODUCT_CATEGORY_ID,  PRODUCT_NAME,PRODUCT_DESCRIPTION,PRICE,PRICE_EFFECTIVE_DATE,  LAST_MODIFIED_DATE,  CREATED_DATE)
        values (1,8,'Apple iPad 36 GB','Apple iPad 36 GB with case', 800,sysdate,sysdate,sysdate);

insert into stg_product ( PRODUCT_ID, PRODUCT_CATEGORY_ID,  PRODUCT_NAME,PRODUCT_DESCRIPTION,PRICE,PRICE_EFFECTIVE_DATE,  LAST_MODIFIED_DATE,CREATED_DATE)
         values (2,7,'Canon 12 MPIX Digital Camera','Canon 12 MPIX Digital Camera, Zoon Len, Case', 150,sysdate,sysdate,sysdate);

commit;


Check the stage table to make sure the records are there.

Sqlplus>select * from stg_product;

Now run the merge statement which was created in section 2. Once the merge query statement was executed, run the following query to checked the records in the dim_product table.

Sqlplus>select * from dim_product;

There should be two rows in the dimension table with the column effective_start_date, effective_end_date and is_current_row populated. Now, let’s find out how the merge statement handles SCD Type 2 records. To begin, run the following queries against the stage table:

update stg_product set price = 700 , price_effective_date = sysdate, last_modified_date = sysdate where product_id = 1;

insert into stg_product (
PRODUCT_ID, 
PRODUCT_CATEGORY_ID,
PRODUCT_NAME,
PRODUCT_DESCRIPTION,
PRICE,
PRICE_EFFECTIVE_DATE,
LAST_MODIFIED_DATE,
CREATED_DATE)
values (3,6,'EPSon Color Printer',
'EPSon color print with high definition photo prints', 200,sysdate,sysdate,sysdate);
 commit;

The above statements updated a record where product_id equal to 1 to change it price to 700, and inserted one more new record, product_id 3, into the stg_product table. To help understand what the records set looks like from the using clause of the merge statement, let’s copy the query and execute it to get the record set.

SELECT DECODE(s.scd_row_type_id,1,-6789,m.product_sid) as product_sid,
                   PRODUCT_ID,
                   PRODUCT_CATEGORY_ID,
                   PRODUCT_NAME,
                   PRODUCT_DESCRIPTION,
                   PRICE,
                   PRICE_EFFECTIVE_DATE,
                   LAST_MODIFIED_DATE,
                   CREATED_DATE,
                   m.scd_row_type_id,
         s.scd_row_type_id
              FROM    (SELECT dp.product_sid,
                              sp.PRODUCT_ID,
                              sp.PRODUCT_CATEGORY_ID,
                              sp.PRODUCT_NAME,
                              sp.PRODUCT_DESCRIPTION,
                              sp.PRICE,
                              sp.PRICE_EFFECTIVE_DATE,
                              sp.LAST_MODIFIED_DATE,
                              sp.CREATED_DATE,
                              CASE
                                 WHEN dp.product_id IS NULL
                                 THEN
                                    1
                                 WHEN (dp.product_category_id !=
                                          sp.product_category_id
                                       OR dp.product_name != sp.product_name
                                       OR DP.PRODUCT_DESCRIPTION !=
                                             sp.product_description
                                       OR dp.price != sp.price
                                       OR dp.price_effective_date !=
                                             sp.price_effective_date)
                                 THEN
                                    2
                                 ELSE
                                    0
                              END
                                 AS scd_row_type_id
                         FROM    stg_product sp
                              LEFT JOIN
                                 Dim_product dp
                              ON (sp.product_id = dp.product_id and  dp.is_current_row = 'Y')
                     ) m
                   JOIN  scd_row_type s
                   ON (s.scd_row_type_id <= m.scd_row_type_id)


Examine the result set, it should look something like:

PRODUCT_SID PRODUCT_ID PRODUCT_CATEGORY_ID PRODUCT_NAME         PRODUCT_DESCRIPTION                                 PRICE   PRICE_EFFECTIVE_DATE LAST_MODIFIED_DATE  CREATED_DATE   SCD_ROW_TYPE_ID
-6789            1            8            Apple iPad 36 GB     Apple iPad 36 GB with the latest createst           700    4/19/2011 2:11        4/19/2011 2:11     4/19/2011 2:10   2
62               1            8            Apple iPad 36 GB     Apple iPad 36 GB with the latest createst           700    4/19/2011 2:11        4/19/2011 2:11     4/19/2011 2:10   2
                 3            6            EPSon Color Printer  EPSon color print with high definition photo prints 200    4/19/2011 2:11        4/19/2011 2:11     4/19/2011 2:11   1

As you can see, we have two rows for product_id 1, and one row for the new record, product_id 3. Also, at the end of the row, you can also see that the scd_record_type_id column has two ids for product id 1. One is 2 and the other one is 1. for the one the that has a scd_record_type_id equql to 1, the product_sid is changed to –6789. Therefore, this record will be used for insert. The logic that define which record to update or insert was defined in the case statement:

(CASE
   WHEN dp.product_id IS NULL
     THEN  1
   WHEN (dp.product_category_id != sp.product_category_id
          OR dp.product_name != sp.product_name
          OR DP.PRODUCT_DESCRIPTION != sp.product_description
          OR dp.price != sp.price
          OR dp.price_effective_date != sp.price_effective_date)
     THEN  2
   ELSE    0
 END) AS scd_row_type_id


If the product_id is not existing in the dim_product table, it is a new record. Therefore, it is assigned a 1 and it will be inserted. If any of the changes occurred to an existing record, it is an update and is assigned a 2. Otherwise, any identical record will be assign a 0. The join statement:

JOIN  scd_row_type s
                   ON (s.scd_row_type_id <= m.scd_row_type_id)

takes only record with a scd_record_type_id that is 1 or greater since the scd_record_type table only have two records, scd_record_type_id 1 and 2. Therefore, any record that with a scd_record_type = 0 (identically records) will not be included in the final result set. To create a type 2 record, which is a existing record that has one or more new attributes, we change the existing product_sid to a dummy number, -6789, by using the decode statement:

DECODE(s.scd_row_type_id,1,-6789,m.product_sid) as product_sid

We choose this number because we had define the surrogate key of the dim_product table to only use positive value for keys. Hence, it will be inserted into the dimension. The "when matched" and "when not matched" clauses in the merge statement are self-explanatory. We matched the record set with the dim_product table on the product_sid key. If it matched, we set the is_current_row flag to ‘N’ and sent the effective_end_date to the last_modified_date. If it was not matched, records will be insert into the dimension table. Now, run the merge statement in section 2 to load the data into the dimension table.

4. Summary

As it is stated in the Oracle documentation, the MERGE statement can be used to select rows from one or more sources for update or insertion into a table or view. You can specify conditions to determine whether to update or insert into the target table or view.

This statement is a convenient way to combine multiple operations. It lets you avoid multiple INSERT, UPDATE, and DELETE DML statements. Therefore, it is helpful in loading data into a slowly changing dimension type 2 table.

MERGE is a deterministic statement. You cannot update the same row of the target table multiple times in the same MERGE statement. Oracle Database does not implement fine-grained access control during MERGE statements.

 If you are using the fine-grained access control feature on the target table or tables, then use equivalent INSERT and UPDATE statements instead of MERGE to avoid error messages and to ensure correct access control.

Reference:


Oracle Online Documentations

Oracle Database 11g The Complete Reference,Kevin Loney, Oracle Press

The Data Warehouse toolkit, Ralph Kimball and Margy Ross

The Data Warehouse ETL Toolkit, Ralph Kimball and Joe Caserta

Other Reference on the Web

Related Note:

SQL Outer Join

13 comments:

Anonymous said...

Thank you! That is great code and exactly what I was looking for!
It stands in the context of http://www.business-intelligence-quotient.com/?p=66 "One pass SCD2 load: How to load a Slowly Changing Dimension Type 2 with one SQL Merge statement in Oracle" by Uli Bethke and http://www.kimballgroup.com/html/08dt/KU107_UsingSQL_MERGESlowlyChangingDimension.pdf "Design Tip #107 Using the SQL MERGE Statement for Slowly Changing Dimension Processing" by Warren Thornthwaite.
It's a pity, that SQL Server has a MERGE-Output clause, that Oracle lacks :-((
But here is your great workaround...
Thanks again
Kai Schubert

Anonymous said...

Hi,

it seems to me, that there is a little mistake in creating the dimension table dim_product:

Product dimension table: dim_product.

CREATE TABLE STG_PRODUCT....

Also the column list of the create statement have not all columns, which are necessary for the insert part of the merge statement.

data bob jr said...

Thank you for pointing that out. I had posted an updated version of the create table statement for the dim_product table.

Anonymous said...

When using the sequence, a PRODUCT_SID value is skipped when performing the Merge.

I worked around this by checking if the DIM_PRODUCT table is empty.

If empty then VALUES (S_DIM_PRODUCT.NEXTVAL

If not empty then VALUES (S_DIM_PRODUCT.NEXTVAL-1

Anonymous said...

Thank you. Its really nice code with great explanation. I am creating cursor with FULL Outer join between stage and dimension tables and then processing each rec. But wanted to give a try with merge statement.

Anonymous said...

What if more that one new records (for the same product id) come from the stage table?
Seems like from this code they both will have end date as 12/31/9999

Unknown said...

Smart solution! I believe with a little tweaking it can even work for hybrid T1/T2 dimensions.

Vladimir Radojičić said...

Great code, thank you!
It is true that Oracle lacks MERGE-Output clause, that SQL Server has, but on the other hand, I like doing more with less :). If you don't need something, don't use it :).

Anonymous said...

thanks much!! awesome solution!!

Anonymous said...

awesome solution!! really improves the performance..

Anonymous said...

"As you can see, we have two rows for product_id 1, and one row for the new record, product_id 3. Also, at the end of the row, you can also see that the scd_record_type_id column has two ids for product id 1. One is 2 and the other one is 1."

Sorry, that's not what the listing above this sentence depicts:

1. The last column is calles SCD_ROW_TYPE_ID, not SCD_RECORD_TYPE_ID.
2. For both records with product_id 1, this SCD_ROW_TYPE_ID has a value of 2, although - according to your explanation - one of the records is to be inserted and one is to be updated.
3. The effective date of the record to-be-inserted should be after the effective date of its temporal predecessor.
4. Referring to the same listing; none of the attributes of Product ID 1 has changed in the Stage Table. So it is not necessary at all to create a new record in the Dimension Table.

Anonymous said...

What if there is no date or flag column in the source? Can we still achieve SCD using Oracle MERGE statement? or is there any alternate way to achieve it?

Anonymous said...

Can we implement this when we have multiple changes ? like for 1 iteration we have 2 different values for a single product. can you modify the solution in that scenario?
Thanks in advance.