Showing posts with label MySQL Data Warehouse. Show all posts
Showing posts with label MySQL Data Warehouse. Show all posts

Friday, June 15, 2012

Create Date Dimension Table in MySQL

Date Dimension Table

A date dimension is used to stored attributes of a calendar date. By assigning a day foreign key to a fact table makes it more efficient to generate date related query results. There many ways to define the table. More attributes can be added base on the data warehouse model and requirements. Here is a sample table created for MySQL database.

create table dim_day (
 day_key int not null auto_increment,   -- primay key.
 date datetime ,     -- date: 2008-08-18 00:00:00
 date_num int(8),    -- numeric value, in YYYYMMDD, 20080818
 day_num int (2),    -- numeric value, 18
 day_of_year int(4), -- the day of the year 
 day_of_week int(2), -- the day of the week
 day_of_week_name varchar(20), -- day of week name (Monday, Tuesday,etc)
 week_num int (2), --  week of the year 
 week_begin_date datetime,  -- week begin date
 week_end_date datetime, -- week end date
 last_week_begin_date datetime,  -- priore week begin date
 last_week_end_date datetime,   priore week end date
 last_2_week_begin_date  datetime,   -- priore two week begin date
 last_2_week_end_date datetime,  -- priore two ween end date
 month_num int (2) ,  -- month in number, ie. 12
 month_name varchar(20),  -- month in name, ie. December
 YEARMONTH_NUM int(6),  -- year and month in number, ie. 201212
 last_month_num int (2), -- priore month in number, ie. 11
 last_month_name varchar(20), -- priore month in name, November
 last_month_year int(4),  -- priore month in year, 2012
 last_yearmonth_num int(6), -- priore year and month in  number, ie, 2o1211
 quarter_num int (2),  -- quarter in number, ie 4
 year_num int (4), -- year in number, ie, 2012
 created_date timestamp not null  ,  -- date record was created
 updated_date timestamp not null , -- date record was updated
 primary key (day_key)
 );

Here is a porcedure use to populate the table in MySql. It takes two input parameners, start_date and end_date, for example, 1990-01-01 and 2099-12-31.

use dwdb;

drop procedure if exists sp_day_dim;

truncate table dim_day;

delimiter //

CREATE PROCEDURE sp_day_dim (in p_start_date datetime, p_end_date datetime)
BEGIN

  Declare StartDate datetime;
  Declare EndDate datetime;
  Declare RunDate datetime;

-- Set date variables

  Set StartDate = p_start_date; -- update this value to reflect the earliest date that you will use.
  Set EndDate = p_end_date; -- update this value to reflect the latest date that you will use.
  Set RunDate = StartDate;

-- Loop through each date and insert into DimTime table

WHILE RunDate <= EndDate DO

INSERT Into dim_day(
 date ,
 date_num,
 day_num ,
 Day_of_Year,
 Day_of_Week,
 Day_of_week_name,
 Week_num,
 week_begin_date,
 week_end_date,
 last_week_begin_date,
 last_week_end_date,
 last_2_week_begin_date,
 last_2_week_end_date,
 Month_num ,
 Month_Name,
 yearmonth_num,
 last_month_num,
 last_month_name,
 last_month_year,
 last_yearmonth_num,
 Quarter_num ,
 Year_num  ,
 created_date, 
 updated_date 
)
select 
RunDate date
,CONCAT(year(RunDate), lpad(MONTH(RunDate),2,'0'),lpad(day(RunDate),2,'0')) date_num
,day(RunDate) day_num
,DAYOFYEAR(RunDate) day_of_year
,DAYOFWEEK(RunDate) day_of_week
,DAYNAME(RunDate) day_of_week_name
,WEEK(RunDate) week_num
,DATE_ADD(RunDate, INTERVAL(1-DAYOFWEEK(RunDate)) DAY) week_begin_date
,ADDTIME(DATE_ADD(RunDate, INTERVAL(7-DAYOFWEEK(RunDate)) DAY),'23:59:59') week_end_date
,DATE_ADD(RunDate, INTERVAL ((1-DAYOFWEEK(RunDate))-7) DAY) last_week_begin_date
,ADDTIME(DATE_ADD(RunDate, INTERVAL ((7-DAYOFWEEK(RunDate))-7) DAY),'23:59:59')last_week_end_date
,DATE_ADD(RunDate, INTERVAL ((1-DAYOFWEEK(RunDate))-14) DAY) last_2_week_begin_date
,ADDTIME(DATE_ADD(RunDate, INTERVAL ((7-DAYOFWEEK(RunDate))-7) DAY),'23:59:59')last_2_week_end_date
,MONTH(RunDate) month_num
,MONTHNAME(RunDate) month_name
,CONCAT(year(RunDate), lpad(MONTH(RunDate),2,'0')) YEARMONTH_NUM
,MONTH(date_add(RunDate,interval -1 month)) last_month_num
,MONTHNAME(date_add(RunDate,interval -1 month)) last_month_name
,year(date_add(RunDate,interval -1 month)) last_month_year
,CONCAT(year(date_add(RunDate,interval -1 month)),lpad(MONTH(date_add(RunDate,interval -1 month)),2,'0')) Last_YEARMONTH_NUM
,QUARTER(RunDate) quarter_num
,YEAR(RunDate) year_num
,now() created_date
,now() update_date
;
-- commit;
-- increase the value of the @date variable by 1 day

Set RunDate = ADDDATE(RunDate,1);

END WHILE;
commit;
END;
//

Reference:

Monday, November 21, 2011

LOAD SLOWLY CHANGING DIMENSION TYPE 2 USING MYSQL STORED PROCEDURE

LOAD SLOWLY CHANGING DIMENSION TYPE 2 USING MYSQL STORED PROCEDURE

Unlike Oracle (merge statement) or other databases, MySQL does not support “upsert” (perform update or insert in one sql statement). Therefore, it requires some coding to achieve both insert and update when loading a type 2 slowly changing dimension.

Here was my lab exercise  which I used a common dimension, product.

1. Tables.

I created the following tables during my lab exercise.

1.1 Dimension table: dim_product.

This was the product dimension table that we want to keep history of changes of production name, product description, product category and price.
Create table dim_product(
PRODUCT_SID int(12) NOT NULL auto_increment ,
PRODUCT_ID int(12) NOT NULL,
PRODUCT_CATEGORY_ID int(10) not null,
PRODUCT_NAME VARCHAR(50),
PRODUCT_DESCRIPTION VARCHAR(200),
PRICE DECIMAL(10,2),
PRICE_EFFECTIVE_DATE DATEtime,
LAST_MODIFIED_DATE DATEtime,
EFFECTIVE_START_DATE DATEtime,
EFFECTIVE_END_DATE DATEtime,
IS_CURRENT_ROW char(1),
CREATED_DATE DATEtime,
UPDATED_DATE DATEtime,
primary key (product_sid)
);

1.2 Staging table: stg_product.

This was the staging table that we will use to store new product records. Assuming production records will transfer to this table first before load into the product dimension.

create table pcdb.stg_product (
product_id int(12) not null,
product_category_id int(10) not null,
product_name varchar(50),
product_description varchar(200),
price decimal(8,2),
price_effective_date datetime,
last_modified_date datetime,
created_date datetime);

1.3 Tempary table: tmp_product.

This is a temporary table that will be used in the store procedure to process data.

create table tmp_product (
PRODUCT_SID int(12) ,
PRODUCT_ID int(12) NOT NULL,
PRODUCT_CATEGORY_ID int(10) not null,
PRODUCT_NAME VARCHAR(50),
PRODUCT_DESCRIPTION VARCHAR(200),
PRICE DECIMAL(10,2),
PRICE_EFFECTIVE_DATE DATEtime,
LAST_MODIFIED_DATE DATEtime,
created_date datetime,
scd_row_type_id tinyint(2)
);

2. The Stored Procedure

I created a stored procedure to accomplish the SCD type 2 dimension load in MySQL. Here was the simple store procedure.

use labdb
drop procedure if exists sp_dim_product;
delimiter //

CREATE PROCEDURE sp_dim_product()

BEGIN

/* Procedure name: sp_dim_prodcut
Description: user subscriptions 
SCD: Type 2.
Dependence: dim_user_product
Table size: large. keep track changes in product.
*/

Declare v_start_date datetime;
declare v_load_id int;
declare v_error_msg varchar(200);
declare v_proc_name varchar(36) default 'sp_dim_product'; 
declare v_validation tinyint;
declare v_max_data_date datetime;

DECLARE EXIT HANDLER FOR SQLEXCEPTION 
BEGIN 
set v_error_msg =concat('Procedure ',v_proc_name,' Encounter SQLEXCEPTION Error: Error code unknown');
SELECT v_error_msg; 
END;

truncate table tmp_product;

insert into tmp_product(
PRODUCT_SID ,
PRODUCT_ID ,
PRODUCT_CATEGORY_ID ,
PRODUCT_NAME ,
PRODUCT_DESCRIPTION ,
PRICE ,
PRICE_EFFECTIVE_DATE ,
LAST_MODIFIED_DATE ,
EFFECTIVE_START_DATE ,
EFFECTIVE_END_DATE ,
IS_CURRENT_ROW ,
scd_row_type_id ,
)
select 
ifnull(d.product_sid,0),
p.product_id,
p.product_category_id,
p.product_name,
p.product_description,
p.price,
p.price_effective_date,
p.last_modified_date,
p.last_modified_date as effective_start_date,
'2050/12/31' as effective_end_date,
'T',
( case when d.product_sid is null then 1 
when (d.product_sid is not null and
(p.product_category_id <> d.product_category_id or
p.product_name <> d.product_name or
p.product_description <> d.PRODUCT_DESCRIPTION or
p.price <> d.price) 
) then 2
else 0 
end) as scd_row_type_id ,
now()
from stg_product p
left join dim_product d 
on (p.product_id = d.product_id and d.is_current_row = 'T')
;

update dim_product d, tmp_product t
set d.is_current_row = 'F', 
d.EFFECTIVE_END_DATE = t.LAST_MODIFIED_DATE,
updated_date = now()
where d.product_sid = t.product_sid
and t.scd_row_type_id = 2;

insert into dim_product(
PRODUCT_ID ,
PRODUCT_CATEGORY_ID ,
PRODUCT_NAME ,
PRODUCT_DESCRIPTION ,
PRICE ,
PRICE_EFFECTIVE_DATE ,
LAST_MODIFIED_DATE ,
EFFECTIVE_START_DATE ,
EFFECTIVE_END_DATE ,
IS_CURRENT_ROW ,
CREATED_DATE ,
UPDATED_DATE 
)
select PRODUCT_ID ,
PRODUCT_CATEGORY_ID ,
PRODUCT_NAME ,
PRODUCT_DESCRIPTION ,
PRICE ,
PRICE_EFFECTIVE_DATE ,
LAST_MODIFIED_DATE ,
LAST_MODIFIED_DATE ,
EFFECTIVE_END_DATE ,
IS_CURRENT_ROW ,
now() ,
now()
from tmp_product
where scd_row_type_id in (1,2)
;

commit;

END;
//

3 How it works

A slowly changing dimension type 2 (SCD2) is defined to keep track changes of attributes in the underlying source table. While there are many attributes in a table, the business rules decide what changes to track. Here, for demonstration purpose, I choose to keep track of changes of each of the following attributes: production name, product description, and product category and price.

In the procedure, I used left outer join on product id, between the dimension table (existing records) and the staging table (new and modified records). For the new records, the product_sid will be null because they were not in the dimention table and I used a scd_row_type_id=1 flag to indicate that they were new records and assigned each of them with a ‘0’ poduct_sid. For the existing records, there would be a product_sid for each record and I used a scd_row_type_id=2 flag to indicate that they were modified exiting records.

select 
ifnull(d.product_sid,0),
p.product_id,
p.product_category_id,
p.product_name,
p.product_description,
p.price,
p.price_effective_date,
p.last_modified_date,
p.last_modified_date as effective_start_date,
'2050/12/31' as effective_end_date,
'T',
( case when d.product_sid is null then 1 
when (d.product_sid is not null and
(p.product_category_id <> d.product_category_id or
p.product_name <> d.product_name or
p.product_description <> d.PRODUCT_DESCRIPTION or
p.price <> d.price) 
) then 2
else 0 
end) as scd_row_type_id ,
now()
from stg_product p
left join dim_product d 
on (p.product_id = d.product_id and d.is_current_row = 'T')
;

update dim_product d, tmp_product t
set d.is_current_row = 'F', 
d.EFFECTIVE_END_DATE = t.LAST_MODIFIED_DATE,
updated_date = now()
where d.product_sid = t.product_sid
and t.scd_row_type_id = 2;

To identify the modified records, I used a case statement. By following the business rules, I compared each of the attributes to see if any of the attributes were modified. If it was, assigned a ‘2’ to the scd_row_type_id flag.

( case when d.product_sid is null then 1 
when (d.product_sid is not null and
(p.product_category_id <> d.product_category_id or
p.product_name <> d.product_name or
p.product_description <> d.PRODUCT_DESCRIPTION or
p.price <> d.price) )
 then 2

To indicate the current record, I created a column called IS_CURRENT_ROW and set it to ‘T’ it was a current row. For existing (history) records, I set it to the IS_CURRENT_ROW to ‘F’.

update dim_product d, tmp_product t
set d.is_current_row = 'F', 
d.EFFECTIVE_END_DATE = t.LAST_MODIFIED_DATE,
updated_date = now()
where d.product_sid = t.product_sid
and t.scd_row_type_id = 2;


At the end of the procedure, I inserted both the new and modified records which the t.scd_row_type_id was ‘1’ or ‘2’. Please note that records had not been modified, would had scd_row_type_id = 0 and would not be inserted into the dimention table.

insert into dim_product(
PRODUCT_ID ,
PRODUCT_CATEGORY_ID ,
PRODUCT_NAME ,
PRODUCT_DESCRIPTION ,
PRICE ,
PRICE_EFFECTIVE_DATE ,
LAST_MODIFIED_DATE ,
EFFECTIVE_START_DATE ,
EFFECTIVE_END_DATE ,
IS_CURRENT_ROW ,
CREATED_DATE ,
UPDATED_DATE 
)
select PRODUCT_ID ,
PRODUCT_CATEGORY_ID ,
PRODUCT_NAME ,
PRODUCT_DESCRIPTION ,
PRICE ,
PRICE_EFFECTIVE_DATE ,
LAST_MODIFIED_DATE ,
LAST_MODIFIED_DATE ,
EFFECTIVE_END_DATE ,
IS_CURRENT_ROW ,
now() ,
now()
from tmp_product
where scd_row_type_id in (1,2)
;

To test the procedure, first run the following query to insert two records into the staging table:

truncate table stg_product;

insert into stg_product ( PRODUCT_ID, PRODUCT_CATEGORY_ID, PRODUCT_NAME,PRODUCT_DESCRIPTION,PRICE,PRICE_EFFECTIVE_DATE, LAST_MODIFIED_DATE, CREATED_DATE)
values (1,8,'Apple iPad 36 GB','Apple iPad 36 GB with case', 800,now(),now(),now());

insert into stg_product ( PRODUCT_ID, PRODUCT_CATEGORY_ID, PRODUCT_NAME,PRODUCT_DESCRIPTION,PRICE,PRICE_EFFECTIVE_DATE, LAST_MODIFIED_DATE,CREATED_DATE)
values (2,7,'Canon 12 MPIX Digital Camera','Canon 12 MPIX Digital Camera, Zoon Len, Case', 150,now(),now(),now());

commit;


Check the stage tables to make sure the records are there.

MySQL>select * from stg_product;

Now run the store procedure which was created in section 2. After running the procedure, run the following query to check the records in the dim_product table.

MySQL>select * from dim_product;

There should be two rows in the dimension table with the column effective_start_date, effective_end_date and is_current_row populated. Now, let’s find out how the merge statement handles SCD Type 2 records. To begin, run the following queries against the stage table:

update stg_product set price = 700 , price_effective_date = now(), last_modified_date = now() where product_id = 1;
insert into stg_product (
PRODUCT_ID, 
PRODUCT_CATEGORY_ID,
PRODUCT_NAME,
PRODUCT_DESCRIPTION,
PRICE,
PRICE_EFFECTIVE_DATE,
LAST_MODIFIED_DATE,
CREATED_DATE)
values (3,6,'EPSon Color Printer',
'EPSon color print with high definition photo prints', 200,now(),now(),now());
commit;


The above statements updated a record where product_id equal to 1 to change it price to 700, and inserted one more new record, product_id 3, into the stg_product table. To help understand what the records set looks like from the select statement that use the left outer join, let’s copy the query and execute it to get the record set.

select 
ifnull(d.product_sid,0),
p.product_id,
p.product_category_id,
p.product_name,
p.product_description,
p.price,
p.price_effective_date,
p.last_modified_date,
p.last_modified_date as effective_start_date,
'2050/12/31' as effective_end_date,
'T',
( case when d.product_sid is null then 1 
when (d.product_sid is not null and
(p.product_category_id <> d.product_category_id or
p.product_name <> d.product_name or
p.product_description <> d.PRODUCT_DESCRIPTION or
.price <> d.price) 
) then 2
else 0 
end) as scd_row_type_id ,
now()
from stg_product p
left join dim_product d 
on (p.product_id = d.product_id and d.is_current_row = 'T')
;


As you can see, we have one rows for product_id 1, and one row for the new record, product_id 3. Also, at the end of the row, you can also see that the scd_record_type_id column has two ids for product id 1. One is 2 and the other one is 1.

Now, run the stored procedure in section 2 to load the data into the dimension table.

4. Summary

This purpose of this procedure is to show how to use outer join to maintain a Type 2 slow changing dimention in MySQL. This procedure dose not include all the necessary error checking routines. Also, there are other ‘real world’ situations need to be considered when implementing this procedure. For example, if there are large amount of records have been modified for any given time, there may be a performance issue when performing an update. Also, as the dimension table growth bigger, outer join will get slower. Therefore, other techniques need to be used to make the codes more efficient.


Reference:

Related Note:

Friday, July 22, 2011

Passing Parameters To A MYSQL Query Inside A Shell Script

Passing Parameter To A MySQL query Inside A Shell Script.

From time to time, I found myself need to run some adhoc queries to pull data. I found that it was more convenience to create a shell script that could accept the query files as variable and can pass variables to the query.

/* script:get_customer_record.sql */

Select c_id, c_first_name,c_last_name, c_address,
……
,last_modified_date
from customer
where last_modified_date >=@start_date and last_modified_date <= @end_date;

@start_date and @end_date are variables to be passed at run time.

Here is the wrapper shell script:

#!/bin/bash

## script name; mysql_script_runner.sh
## wrapper script to execute mysql script with variables

ARGS=4

if [ $# -ne "$ARGS" ]
then 

 echo "you passed $# parameters"
 echo "Usage: `basename $0` sql_script_file start_date end_date output_file"

exit 
fi 
sql_script=$1 
start_date=$2 
end_date=$3 
output_file=$4 

#run mysql query with paramenters 

/usr/bin/mysql –uuser_id -ppassword –h mysql-host -A -e "set @start_date=${start_date}; set @end_date=${end_date}; source ${sql_script};" >${data_file};

exit

# end of script.


To execute the script from command line, I can do something like this:

 # mysql_script_runner.sh get_customer_record.sql ‘2011-06-01’ ‘2011-06-12 23:59:59’ cust_rec.dat

I can also set it up to run as a cron job with some small changes.

References:


Related Note:

Tuesday, October 26, 2010

MYSQL SPEEPUP INSERT INTO A MYISAM TABLE

Accordint to the MySQL Manual, there are a few ways to speedup inserts while adding data to a MyISAM table. Following are the original document from MySQL manual plus the notes I added while I was experiencing with real data loads.

1. Background Information
 
The time required for inserting a row, according to the MySQL manual, is determined by the following factors, where the numbers indicate approximate proportions:
· Cnnecting: (3)
· There are many server variables that can be used to help optimize the user connections as stated in the post:
· Sending query to server: (2)
· Parsing query: (2)
· Inserting row: (1 × size of row)
· Inserting indexes: (1 × number of indexes)
· Closing: (1)

There are two additional things which need to be considered regarding overheard for inserts:
  • This does not take into consideration the initial overhead to open tables, which is done once for each concurrently running query. To try to reduce the times need to open table, try to cache them. To cache the table, set the table_cache variable. Please note that the table_cache, max_connections, and max_tmp_tables system variables affect the maximum number of files the server keeps open. If table_cache is set too high, MySQL may run out of file descriptors and refuse connections, fail to perform queries, and be very unreliable. You also have to take into account that the MyISAM storage engine needs two file descriptors for each unique open table. Please refer to this note for connection related variable setting.
  • The size of the table slows down the insertion of indexes by log N, assuming B-tree indexes.

2. Methods Which Can Be Used to speed up Inserts 
These are the methods which can be used to speed up inserts. You will notice that most of the methods mentions here are related to MySQL MyISAM tables.

(1.) If you are inserting many rows from the same client at the same time, use INSERT statements with multiple VALUES lists to insert several rows at a time. This is considerably faster (many times faster in some cases) than using separate single-row INSERT statements.
   Insert into t1 (a,b,c,d) values (1,2,3,’2010-01-01’), (3,5,6,’2010-02-01’), (7,8,9,’2010-03-01’);

Instead of:

   Insert into t1 (a,b,c,d) values (1,2,3,’2010-01-01’);
   Insert into t1 (a,b,c,d) values (3,5,6,’2010-02-01’);
   Insert into t1 (a,b,c,d) values (7,8,9,’2010-03-01’);

(2.) If you are adding data to a nonempty table, you can tune the bulk_insert_buffer_size variable to make data insertion even faster.

(3.) If multiple clients are inserting a lot of rows, you can get higher speed by using the INSERT DELAYED statement. “Insert Delay” hint delays the inserts by storing the insert statements in the memory queue utile the target table is free from reads. Hence, the insert statement will return quickly and all the insert statements that are stored in the queue can be written in one block, which is more optimized that written one statement at a time.



Hear is the syntax for insert delay:

Insert delay into t1 (a,b,c,d) values (1,2,3,’2010-01-01’), (3,5,6,’2010-02-01’), (7,8,9,’2010-03-01’);
See Section 12.2.5.2, INSERT
Please note that delay insert only works with MyISAM, MEMORY, ARCHIVE and BLACKHOLE(v. 5.1.19) table. And, there is downside of using “Insert Delayed”. There will be negative performance gain if the table is empty or not being used and add overhead to the server to hold additional thread for each table for which there is delayed rows, etc. Please read the manual for details on the negative impacts.
Theses are system status variables can be used to monitor delay insert uses. Here are the Status Variable meaning:
  • Delayed_insert_threads -- Number of handler threads
  • Delayed_writes -- Number of rows written with INSERT DELAYED
  • Not_flushed_delayed_rows -- Number of rows waiting to be written

 (4.) For a MyISAM table, you can use concurrent inserts to add rows at the same time that SELECT statements are running, this works well for tables that have more reads than writes because it just keeps adding data at the end of the data file.
To enable concurrent inserts, we need to set the concurrent_insert system variable. The values for the variable are:

· 0 -- Disables concurrent inserts
· 1 -- Enables concurrent insert for MyISAM tables that do not have holes. A hole means that there is free space in the data file, i.e., rows were deleted from the table. This is the default value.
· 2 -- Enables concurrent inserts for all MyISAM tables, even those that have holes. For a table with a hole,  new rows are inserted at the end of the table if it is in use by another thread. Otherwise, MySQL acquires a normal write lock and inserts the row into the hole.

To set the variable with other values (0 or 2) rather than the default value, for example concurrent_insert = 2, enter the parameter in the my.cnf file and restart the server. Or you can also dynamically set it in MySQL, for example:
   MySQL> set global concurrent_insert = 2;

Here are a few things that need to keep in mind.

1. ) Under circumstances where concurrent inserts can be used, there is seldom any need to use the DELAYED modifier for INSERT statements.
2. ) If you are using the binary log, concurrent inserts are converted to normal inserts for CREATE ... SELECT or INSERT ... SELECT statements. This is done to ensure that you can re-create an exact copy of your tables by applying the log during a backup operation.
3. ) With LOAD DATA INFILE, if you specify CONCURRENT with a MyISAM table that satisfies the condition for concurrent inserts (that is, it contains no free blocks in the middle), other sessions can retrieve data from the table while LOAD DATA is executing.
4)Use of the CONCURRENT option affects the performance of LOAD DATA a bit, even if no other session is using the table at the same time.
MySQL document has more information on using concurrent inserts.
(5.) Using bulk load command, LOAD DATA INFILE, to load data into a table from a text file can be 20 times faster than using INSERT statements.

Updating indexes takes time and resources during inserts. When data are loaded into a table that has many indexes, we can follow the steps bellow to speed up the loads.

1.) Disable the indexes
Non unique indexes of a MyISAM table can be disabled using the following statement:
MySQL> ALTER TABLE t1 DISABLE KEYS;
The alter table disable keys statement also dose a flush table.
2.) Load data into the table using LOAD DATA INFILE statement. This does not update any indexes and therefore is very fast.
3.) Enable the indexes
Indexes can be enabled using the following statement:

MySQL> ALTER TABLE tbl_name ENABLE KEYS;

Here is a note on MySQL Disable and Enable Indexes Command.

http://databobjr.blogspot.com/2010/05/mysql-alter-table-disable-keys.html

You can also use myisamchk --keys-used=0 -rq /path/to/db/tbl_name to disable index keys and use myisamchk -rq /path/to/db/tbl_name to enable index keys. However, if you used myisamchk to disable or enable index keys, you must issue Flush Tables statement or a mysqladmin flush-tables command. More information can be found here: http://dev.mysql.com/doc/refman/5.0/en/myisamchk.html
.
(6.) To increase performance for MyISAM tables, for both LOAD DATA INFILE and INSERT, enlarge the key cache by increasing the key_buffer_size system variable.

2. SUMMARY

The methodologies mentioned above were not THE only methods that could speed up inserts nor should they be used exclusively in that orders. It requires a lot of trails and errors to have one that works for a particular environment.

References:



Friday, July 30, 2010

IP To Country Lookup Table Using MySQL Spatial Index

I have an IP to Country lookup table that is used by various applications. The table is very simple:

CREATE TABLE `lu_ip_to_country` (
     id INT UNSIGNED NOT NULL auto_increment,
     start_ip_num int(10) UNSIGNED NOT NULL,
     end_ip_num int(10) UNSIGNED NOT NULL,
     country_id char(2) NOT NULL,
     iso_country_code varchar(3) DEFAULT NULL,
     country_name varchar(100) DEFAULT NULL,
     PRIMARY KEY (id),
     key start_end_ip (start_ip_num,end_ip_num)
);
  •  Start_ip_num: Beginning IP Number
  •  End_ip_num: Ending IP Number
  •  Country_id: two character country abbreviations (US, BG, FR, etc.) similar to ISO code
  •  ISO_country_code: ISO 3166 Country Code
  •  Country_name: Country Name

Beginning IP Number and Ending IP Number are the number version of the IPs which are calculated as follows:
       Ip number = 16777216*w + 65536*x + 256*y + z
Where w, x, y, z represent the IP address (w, x,y,z). For example: if ip is 61.94.7.255,then w=61,x=94,y=7 and z=255.

 The queries that use to lookup country information are simple as well:
      Select country_id from lu_ip_to_country
      Where v_ip_num_lookup between start_ip_num and end_ip_num

In this query, the v_ip_num_lookup is a variable used to store the calculated ip number. For example, '61.94.7.255’=1029572607.

The other query utilized the MySQL INET_ATON() function, which convert the ip to ip number. i.e. INET_ATON('61.94.7.255') = 1029572607. The query is the following:

   Select country_id from lu_ip_to_country
   Where INET_ATON('61.94.7.255') between start_ip_num and end_ip_num

These tow queries worked fine but slow. It will take up-to 1.5 second to return the result. And they were even slower when there were concurrent requests to this table. I had tried different indexes, and even tried to use memory table, but the improvements were not notable.

The other week, while I was searching the web, I came across an article, which was written by Jeremy Cole, described how to use MySQL GIS with spatial R-tree indexes for similar ip-to-country lookup in MySQL.I decided to give it a try. And, it turn out that the result was excellence.

Jeremy Cole’s article gave a vary detail explanation on how GIS with spatial R-tree indexes work in this case. Here are the steps on how I implemented and tested it (following what were described in the article).

1. Reconstruct the table to using POLYGON type and spatial index

CREATE TABLE lu_ip_to_country_geoip (
     id INT UNSIGNED NOT NULL auto_increment,
     ip_poly POLYGON NOT NULL,
     start_ip_num int(10) UNSIGNED NOT NULL,
     end_ip_num int(10) UNSIGNED NOT NULL,
     country_id varchar(3) NOT NULL,
     iso_country_code varchchar(3) DEFAULT NULL,
    country_name varchar(100) DEFAULT NULL,
    PRIMARY KEY (id),
    SPATIAL INDEX (ip_poly)
);

2. Export the data from the existing table
     Select start_ip_num,
                end_ip_num,
                country_id,
               iso_country_code,
              country_name
             Into outfile ‘/tmp/ip-country-data.dat’
     FIELDS
            TERMINATED BY ","
              ENCLOSED BY "\""
     LINES
           TERMINATED BY "\n"
      FROM lu_ip_to_country;

If you don’t have the data in the first place, you can download them from MaxMind web site (http://www.maxmind.com/app/country) as mentioned in the article.

3. Load the data into the new table with the MySQL GIS function to build the POLYGON for ip_poly field from the start_ip_num and end_ip_num fields.

    LOAD DATA LOCAL INFILE "/tmp/ip-country-data.dat"
         INTO TABLE lu_ip_to_country_geoip
         FIELDS
            TERMINATED BY ","
             ENCLOSED BY "\""
         LINES
              TERMINATED BY "\n"
           (
               @start_ip_num, @end_ip_num,
                @country_id,@iso_country_code, @country_name
             )
       SET
            id := NULL,
            start_ip_num := @start_ip_num,
           end_ip_num := @end_ip_num,
           ip_poly := GEOMFROMWKB(POLYGON(LINESTRING(
          /* clockwise, 4 points and back to 0 */
          POINT(@start_ip_num, -1), /* 0, top left */
          POINT(@end_ip_num, -1), /* 1, top right */
          POINT(@end_ip_num, 1), /* 2, bottom right */
          POINT(@start_ip_num, 1), /* 3, bottom left */
         POINT(@start_ip_num, -1) /* 0, back to start */))),
       country_id := @country_code,
       iso_country_code := @iso-country_code,
       country_name := @country_string;

4. Test queries

mysql> SELECT country_id,iso_country_code,country_name FROM lu_ip_to_country WHERE INET_ATON('61.94.7.255') BETWEEN start_ip_address AND end_ip_address;
+------------+------------------+--------------+
| country_id | iso_country_code | country_name |
+------------+------------------+--------------+
| ID | IDN | Indonesia |
+------------+------------------+--------------+
1 row in set (1.57 sec)

mysql> SELECT country_id,iso_country_code,country_name FROM lu_ip_to_country_geoip WHERE MBRCONTAINS(ip_poly, POINTFROMWKB(POINT(INET_ATON('61.94.7.255'), 0)));
+------------+------------------+--------------+
| country_id | iso_country_code | country_name |
+------------+------------------+--------------+
| ID | IDN | Indonesia |
+------------+------------------+--------------+
1 row in set (0.35 sec)

5. Perform A/B Test on ETL process.

I also selected a reasonable data file and ran through the process using old ip-to-country table for process A and new ip-to-country table for process B and QA’edfor the following:

  •  Check to make sure all Country IDs/Codes match exactly on both out put
  • Compare times
 Both processes produced the same country code lookups except the new table made the same process run much faster.

References:


Friday, May 21, 2010

MySQL: Optimize Table

MySQL Optimize Table Command

The Optimize table command defragments, reclaims free space and makes concurrent inserts wrk again. For MyISAM table, optimize table command performs the following tasks:
  • Repaire the table if the table has deleted or split rows. This, in trun, help defragment and reclaim free spaces.
  • Sort index pages if it is needed.
  • Update the table’s statistics.
 One way to find out if one should run optimize table on a table is to compare the data length and data free values from the show table status command.

  •   Data_length is the length of the data file.
  •   Data_free is the the number of allocated but unused byte. These included those deleted records.
 When a data_free value is relative large in comparing to date_length, optimize table should be performed agaist the table. Please see the following example:

    mysql> show table status like '%table_name%' \G;

*************************** 1. row ***************************

 Name: table_name
 Engine: MyISAM
 Version: 10
 Row_format: Dynamic
Rows: 1473839
Avg_row_length: 186
Data_length: 420392440
Max_data_length: 281474976710655
Index_length: 52712448
Data_free: 145156624
Auto_increment: NULL
 Create_time: 2010-05-18 18:35:16
 Update_time: 2010-05-20 13:44:47
 Check_time: 2010-05-20 02:21:54
 Collation: latin1_swedish_ci
 Checksum: NULL
 Create_options: delay_key_write=1
 Comment:

   mysql> optimize table table_name;

+--------------------------------+----------+----------+----------+
Table                                           Op             Msg_type  Msg_text
 +--------------------------------+----------+----------+----------+
 dwstgdb.table_name                   optimize      status          OK
 +--------------------------------+----------+----------+----------+
 1 row in set (50.08 sec)

   mysql> show table status like '%table_name%' \G;

*************************** 1. row ***************************
 Name: table_name
 Engine: MyISAM
 Version: 10
 Row_format: Dynamic
 Rows: 1473839
 Avg_row_length: 186
 Data_length: 275235816
 Max_data_length: 281474976710655
 Index_length: 30075904
 Data_free: 0
 Auto_increment: NULL
 Create_time: 2010-05-18 18:35:16
 Update_time: 2010-05-21 00:33:58
 Check_time: 2010-05-21 00:34:14
 Collation: latin1_swedish_ci
 Checksum: NULL
 Create_options: delay_key_write=1
 Comment:

Now we can see that the data_free value down to 0 from 145156624 after performing optimize table. This will indeed deframent the table and speep up the queries. In general, optimize table command should be performed on a reqular basis.

Referemces:



MySQL: Alter Table Disable Keys

MySQL Disable Indexes Command

When inserting,updating or deleting a high percenblege of the data from a table, it is a best pratce to disable indexes. Especaily, when doing a bulk load into a empty table.  To disable indexes, one can drop all the index keys and recreate them later. Or, we can use Alter Table table name Disable Keys to deactivate  indexes  on a MyISAM table.

To do that, before performing the sql statements that involve with insert, update or delete, issue the following command:

         mysql> alter table tablename disable keys;

Now perform the tasks; for example:

        mysql> insert into tablename(...) values( ...);

After the task has been completed, issue the following command to reactivate the indexes:

         mysql>alter table tablename enable keys;

ALTER TABLE tablename DISABLE KEYS tells MySQL to stop updating nonunique indexes. ALTER TABLE tablename ENABLE KEYS tells MySQL to re-create missing indexes. MySQL does this with a special algorithm that is much faster than inserting keys one by one, so disabling keys before performing bulk insert operations should give a considerable speedup. When using ALTER TABLE tablename DISABLE KEYS, please keep in mind the following:

  •  ALTER TABLE tablename DISABLE KEYS will keep UNIQUE keys active.
  • Using ALTER TABLE tablename DISABLE KEYS requires ALTER, INSERT, CREATE and the INDEX privilege for that table.
  • While the nonunique indexes are disabled, they are ignored for statements such as SELECT and EXPLAIN that otherwise would use them.
  • ENABLE KEYS and DISABLE KEYS were not supported for partitioned tables prior to MySQL 5.1.11.
 References:


MySQ: Delay_Key_Write

MySQL Alter Table t Delay_ Key_ Write Command

The delay_key_write option works with MyISAM tables. Delayed key writing tells storage engine not to wtire change index data to disk at the end of any query that changes the data. Instead, the changes are kept in the key buffer until the server prunes the buffer or close the table. This option can boots performance if the MyISAM table is changing frequenry, for example, has a lot of inserts and updates.

To turn on delay_key_write option for a table, you can include the following statement in to create table script:

     Delay_key_write =1

Or for existing table, use the alter table statement:

    MySQL> ALTER TABLE table name DELAY_KEY_WRITE = 1;

Please note that,in MySQL, Alter Table statement takes time to excute and locks the table. Therefore, for large tables, so do this with care.

There are a few thing need to keep in mind when using delay_key_write option as well. First, the changes, mainly, the index changes, will be kep in memory until the buffer cache prunes or the table is closed. A table is closed when:
  •   When the cache is full and a thread tries to open a table that is not in the cache.
  •  When the cache contains more than table_cache entries and a thread is no longer using a table.
  •  When FLUSH TABLES command is issued.
 This means that if the server or table is crashed, there will be indexes cruptions since the indexes on the index file (tablename.MYI) was not updated while the data file (tablename.MYD) was. Therefore, it is recomanded that the myisam-recover variable to be turned on during server startup. For example, set
 --myisam-recover=BACKUP,FORCE in my.cnf file or start the server with –myisam-recover option. Secondly, if the table has many write delays, it will table longer for the table to close since it needs time to flush all the caches to the disk, and it will take longer to run FLUSH TABLES statements(flush table, flush table with read lock) Hence, flush tables should be used a bit frequely if you use delay_key_write option. Lastly, queries may need to wait for the MyISAM engine to free up space in the key buffer cache because the unflushed dirty blocks in the key buffer cache take up space.

Delay_Key_write option can also be enable globally for all MyISAM tables. To do so, set the delay_key_write option to ‘all’:

   delay_key_write = all

However, it is not a commend pratics to most systems because for infrequently written tables, delay_key_write may increase the chance of index corruption.

 References: