Monday, November 21, 2011

LOAD SLOWLY CHANGING DIMENSION TYPE 2 USING MYSQL STORED PROCEDURE

LOAD SLOWLY CHANGING DIMENSION TYPE 2 USING MYSQL STORED PROCEDURE

Unlike Oracle (merge statement) or other databases, MySQL does not support “upsert” (perform update or insert in one sql statement). Therefore, it requires some coding to achieve both insert and update when loading a type 2 slowly changing dimension.

Here was my lab exercise  which I used a common dimension, product.

1. Tables.

I created the following tables during my lab exercise.

1.1 Dimension table: dim_product.

This was the product dimension table that we want to keep history of changes of production name, product description, product category and price.
Create table dim_product(
PRODUCT_SID int(12) NOT NULL auto_increment ,
PRODUCT_ID int(12) NOT NULL,
PRODUCT_CATEGORY_ID int(10) not null,
PRODUCT_NAME VARCHAR(50),
PRODUCT_DESCRIPTION VARCHAR(200),
PRICE DECIMAL(10,2),
PRICE_EFFECTIVE_DATE DATEtime,
LAST_MODIFIED_DATE DATEtime,
EFFECTIVE_START_DATE DATEtime,
EFFECTIVE_END_DATE DATEtime,
IS_CURRENT_ROW char(1),
CREATED_DATE DATEtime,
UPDATED_DATE DATEtime,
primary key (product_sid)
);

1.2 Staging table: stg_product.

This was the staging table that we will use to store new product records. Assuming production records will transfer to this table first before load into the product dimension.

create table pcdb.stg_product (
product_id int(12) not null,
product_category_id int(10) not null,
product_name varchar(50),
product_description varchar(200),
price decimal(8,2),
price_effective_date datetime,
last_modified_date datetime,
created_date datetime);

1.3 Tempary table: tmp_product.

This is a temporary table that will be used in the store procedure to process data.

create table tmp_product (
PRODUCT_SID int(12) ,
PRODUCT_ID int(12) NOT NULL,
PRODUCT_CATEGORY_ID int(10) not null,
PRODUCT_NAME VARCHAR(50),
PRODUCT_DESCRIPTION VARCHAR(200),
PRICE DECIMAL(10,2),
PRICE_EFFECTIVE_DATE DATEtime,
LAST_MODIFIED_DATE DATEtime,
created_date datetime,
scd_row_type_id tinyint(2)
);

2. The Stored Procedure

I created a stored procedure to accomplish the SCD type 2 dimension load in MySQL. Here was the simple store procedure.

use labdb
drop procedure if exists sp_dim_product;
delimiter //

CREATE PROCEDURE sp_dim_product()

BEGIN

/* Procedure name: sp_dim_prodcut
Description: user subscriptions 
SCD: Type 2.
Dependence: dim_user_product
Table size: large. keep track changes in product.
*/

Declare v_start_date datetime;
declare v_load_id int;
declare v_error_msg varchar(200);
declare v_proc_name varchar(36) default 'sp_dim_product'; 
declare v_validation tinyint;
declare v_max_data_date datetime;

DECLARE EXIT HANDLER FOR SQLEXCEPTION 
BEGIN 
set v_error_msg =concat('Procedure ',v_proc_name,' Encounter SQLEXCEPTION Error: Error code unknown');
SELECT v_error_msg; 
END;

truncate table tmp_product;

insert into tmp_product(
PRODUCT_SID ,
PRODUCT_ID ,
PRODUCT_CATEGORY_ID ,
PRODUCT_NAME ,
PRODUCT_DESCRIPTION ,
PRICE ,
PRICE_EFFECTIVE_DATE ,
LAST_MODIFIED_DATE ,
EFFECTIVE_START_DATE ,
EFFECTIVE_END_DATE ,
IS_CURRENT_ROW ,
scd_row_type_id ,
)
select 
ifnull(d.product_sid,0),
p.product_id,
p.product_category_id,
p.product_name,
p.product_description,
p.price,
p.price_effective_date,
p.last_modified_date,
p.last_modified_date as effective_start_date,
'2050/12/31' as effective_end_date,
'T',
( case when d.product_sid is null then 1 
when (d.product_sid is not null and
(p.product_category_id <> d.product_category_id or
p.product_name <> d.product_name or
p.product_description <> d.PRODUCT_DESCRIPTION or
p.price <> d.price) 
) then 2
else 0 
end) as scd_row_type_id ,
now()
from stg_product p
left join dim_product d 
on (p.product_id = d.product_id and d.is_current_row = 'T')
;

update dim_product d, tmp_product t
set d.is_current_row = 'F', 
d.EFFECTIVE_END_DATE = t.LAST_MODIFIED_DATE,
updated_date = now()
where d.product_sid = t.product_sid
and t.scd_row_type_id = 2;

insert into dim_product(
PRODUCT_ID ,
PRODUCT_CATEGORY_ID ,
PRODUCT_NAME ,
PRODUCT_DESCRIPTION ,
PRICE ,
PRICE_EFFECTIVE_DATE ,
LAST_MODIFIED_DATE ,
EFFECTIVE_START_DATE ,
EFFECTIVE_END_DATE ,
IS_CURRENT_ROW ,
CREATED_DATE ,
UPDATED_DATE 
)
select PRODUCT_ID ,
PRODUCT_CATEGORY_ID ,
PRODUCT_NAME ,
PRODUCT_DESCRIPTION ,
PRICE ,
PRICE_EFFECTIVE_DATE ,
LAST_MODIFIED_DATE ,
LAST_MODIFIED_DATE ,
EFFECTIVE_END_DATE ,
IS_CURRENT_ROW ,
now() ,
now()
from tmp_product
where scd_row_type_id in (1,2)
;

commit;

END;
//

3 How it works

A slowly changing dimension type 2 (SCD2) is defined to keep track changes of attributes in the underlying source table. While there are many attributes in a table, the business rules decide what changes to track. Here, for demonstration purpose, I choose to keep track of changes of each of the following attributes: production name, product description, and product category and price.

In the procedure, I used left outer join on product id, between the dimension table (existing records) and the staging table (new and modified records). For the new records, the product_sid will be null because they were not in the dimention table and I used a scd_row_type_id=1 flag to indicate that they were new records and assigned each of them with a ‘0’ poduct_sid. For the existing records, there would be a product_sid for each record and I used a scd_row_type_id=2 flag to indicate that they were modified exiting records.

select 
ifnull(d.product_sid,0),
p.product_id,
p.product_category_id,
p.product_name,
p.product_description,
p.price,
p.price_effective_date,
p.last_modified_date,
p.last_modified_date as effective_start_date,
'2050/12/31' as effective_end_date,
'T',
( case when d.product_sid is null then 1 
when (d.product_sid is not null and
(p.product_category_id <> d.product_category_id or
p.product_name <> d.product_name or
p.product_description <> d.PRODUCT_DESCRIPTION or
p.price <> d.price) 
) then 2
else 0 
end) as scd_row_type_id ,
now()
from stg_product p
left join dim_product d 
on (p.product_id = d.product_id and d.is_current_row = 'T')
;

update dim_product d, tmp_product t
set d.is_current_row = 'F', 
d.EFFECTIVE_END_DATE = t.LAST_MODIFIED_DATE,
updated_date = now()
where d.product_sid = t.product_sid
and t.scd_row_type_id = 2;

To identify the modified records, I used a case statement. By following the business rules, I compared each of the attributes to see if any of the attributes were modified. If it was, assigned a ‘2’ to the scd_row_type_id flag.

( case when d.product_sid is null then 1 
when (d.product_sid is not null and
(p.product_category_id <> d.product_category_id or
p.product_name <> d.product_name or
p.product_description <> d.PRODUCT_DESCRIPTION or
p.price <> d.price) )
 then 2

To indicate the current record, I created a column called IS_CURRENT_ROW and set it to ‘T’ it was a current row. For existing (history) records, I set it to the IS_CURRENT_ROW to ‘F’.

update dim_product d, tmp_product t
set d.is_current_row = 'F', 
d.EFFECTIVE_END_DATE = t.LAST_MODIFIED_DATE,
updated_date = now()
where d.product_sid = t.product_sid
and t.scd_row_type_id = 2;


At the end of the procedure, I inserted both the new and modified records which the t.scd_row_type_id was ‘1’ or ‘2’. Please note that records had not been modified, would had scd_row_type_id = 0 and would not be inserted into the dimention table.

insert into dim_product(
PRODUCT_ID ,
PRODUCT_CATEGORY_ID ,
PRODUCT_NAME ,
PRODUCT_DESCRIPTION ,
PRICE ,
PRICE_EFFECTIVE_DATE ,
LAST_MODIFIED_DATE ,
EFFECTIVE_START_DATE ,
EFFECTIVE_END_DATE ,
IS_CURRENT_ROW ,
CREATED_DATE ,
UPDATED_DATE 
)
select PRODUCT_ID ,
PRODUCT_CATEGORY_ID ,
PRODUCT_NAME ,
PRODUCT_DESCRIPTION ,
PRICE ,
PRICE_EFFECTIVE_DATE ,
LAST_MODIFIED_DATE ,
LAST_MODIFIED_DATE ,
EFFECTIVE_END_DATE ,
IS_CURRENT_ROW ,
now() ,
now()
from tmp_product
where scd_row_type_id in (1,2)
;

To test the procedure, first run the following query to insert two records into the staging table:

truncate table stg_product;

insert into stg_product ( PRODUCT_ID, PRODUCT_CATEGORY_ID, PRODUCT_NAME,PRODUCT_DESCRIPTION,PRICE,PRICE_EFFECTIVE_DATE, LAST_MODIFIED_DATE, CREATED_DATE)
values (1,8,'Apple iPad 36 GB','Apple iPad 36 GB with case', 800,now(),now(),now());

insert into stg_product ( PRODUCT_ID, PRODUCT_CATEGORY_ID, PRODUCT_NAME,PRODUCT_DESCRIPTION,PRICE,PRICE_EFFECTIVE_DATE, LAST_MODIFIED_DATE,CREATED_DATE)
values (2,7,'Canon 12 MPIX Digital Camera','Canon 12 MPIX Digital Camera, Zoon Len, Case', 150,now(),now(),now());

commit;


Check the stage tables to make sure the records are there.

MySQL>select * from stg_product;

Now run the store procedure which was created in section 2. After running the procedure, run the following query to check the records in the dim_product table.

MySQL>select * from dim_product;

There should be two rows in the dimension table with the column effective_start_date, effective_end_date and is_current_row populated. Now, let’s find out how the merge statement handles SCD Type 2 records. To begin, run the following queries against the stage table:

update stg_product set price = 700 , price_effective_date = now(), last_modified_date = now() where product_id = 1;
insert into stg_product (
PRODUCT_ID, 
PRODUCT_CATEGORY_ID,
PRODUCT_NAME,
PRODUCT_DESCRIPTION,
PRICE,
PRICE_EFFECTIVE_DATE,
LAST_MODIFIED_DATE,
CREATED_DATE)
values (3,6,'EPSon Color Printer',
'EPSon color print with high definition photo prints', 200,now(),now(),now());
commit;


The above statements updated a record where product_id equal to 1 to change it price to 700, and inserted one more new record, product_id 3, into the stg_product table. To help understand what the records set looks like from the select statement that use the left outer join, let’s copy the query and execute it to get the record set.

select 
ifnull(d.product_sid,0),
p.product_id,
p.product_category_id,
p.product_name,
p.product_description,
p.price,
p.price_effective_date,
p.last_modified_date,
p.last_modified_date as effective_start_date,
'2050/12/31' as effective_end_date,
'T',
( case when d.product_sid is null then 1 
when (d.product_sid is not null and
(p.product_category_id <> d.product_category_id or
p.product_name <> d.product_name or
p.product_description <> d.PRODUCT_DESCRIPTION or
.price <> d.price) 
) then 2
else 0 
end) as scd_row_type_id ,
now()
from stg_product p
left join dim_product d 
on (p.product_id = d.product_id and d.is_current_row = 'T')
;


As you can see, we have one rows for product_id 1, and one row for the new record, product_id 3. Also, at the end of the row, you can also see that the scd_record_type_id column has two ids for product id 1. One is 2 and the other one is 1.

Now, run the stored procedure in section 2 to load the data into the dimension table.

4. Summary

This purpose of this procedure is to show how to use outer join to maintain a Type 2 slow changing dimention in MySQL. This procedure dose not include all the necessary error checking routines. Also, there are other ‘real world’ situations need to be considered when implementing this procedure. For example, if there are large amount of records have been modified for any given time, there may be a performance issue when performing an update. Also, as the dimension table growth bigger, outer join will get slower. Therefore, other techniques need to be used to make the codes more efficient.


Reference:

Related Note:

1 comment:

Anonymous said...

This is very interesting, You're a very skilled blogger.
I've joined your rss feed and look forward to seeking more of your fantastic post.

Also, I have shared your web site in my social networks!


My web page: check it out