Saturday, August 07, 2010

DATA WAREHOUSE CONCEPTS

Data Warehouse Concepts

1. Introduction:

Over the years, I had done a few data warehouse projects and read a few data warehouse and business intelligence books and white pages. However, I had never really summarized the concepts, terminologies, and techniques that I had learned. Here, I will start to put together the basic concepts, and terminologies that are frequently referred to and mythologies and techniques that are frequently used and talked about.

I am going to organize the topic base on the following data warehouse concepts. In this note, I will provide summarize version of the concepts. In the future note, I will provide more explanation and example of the each concept.

  •   Data warehouse and data mart
  •  Fact and Dimension
  •  Star Schema and Snowflake Schema
  •  Extract, Transform and Load (ETL)
  •  Meta Data
  •  Cube
2. Data Warehouse and Data Mart
2.1 Data Warehouse

A data warehouse is a central repository for all or significant parts of the data that an enterprise's various business systems collect. The goal of the data warehouse is make enterprise data easily accessible for strategic decision making.

There are two well-known authors on data warehousing: Bill Inmon and Ralph Kimball. Following are definitions from each of them:


  • Bill Inmon: A data warehouse is a subject-oriented, integrated, nonvolatile, and time-variant collection of  data in support of management’s decisions. 




    • Subject-oriented - Data that gives information about a particular subject instead of about a company's   on-going operations.
    •  Integrated - Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole.
    • Time-variant - All data in the data warehouse is identified with a particular time period.
    •  Non-volatile - Data is stable in a data warehouse. More data is added, but data is never removed. This enables management to gain a consistent picture of the business. 


  • Ralph Kimball: The conglomeration of an organization’s data warehouse staging and presentation areas, where operational data is specifically structured for query and analysis performance and ease-of-use.




  • 2.2 Data Mart
     A data mart is a repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers. In scope, the data may derive from an enterprise-wide database or data warehouse or be more specialized. The emphasis of a data mart is on meeting the specific demands of a particular group of knowledge users in terms of analysis, content, presentation, and ease-of-use. Users of a data mart can expect to have data presented in terms that are familiar.

    2.3 Data Warehouse vs. Data Mart:
    •  A data warehouse tends to be a strategic but somewhat unfinished concept. The design of a data warehouse tends to start from an analysis of what data already exists and how it can be collected in such a way that the data can later be used. A data warehouse is a central aggregation of data (which can be distributed physically); 
    • A data mart tends to be tactical and aimed at meeting an immediate need. The design of a data mart tends to start from an analysis of user needs. A data mart is a data repository that may derive from a data warehouse or not and that emphasizes ease of access and usability for a particular designed purpose. 
    3. Fcts and Dimensions
    3.1 Facts

    A fact table is the central table in a star join schema characterized by a composite key, each of whose elements is a foreign key drawn from a dimension table.
    • Represent a business process, i.e., models the business process as an artifact in the data model 
    • Contain the measurements or metrics or facts of business processes. 
    • Most are additive (sales this month), some are semi-additive (balance as of), some are not additive (unit price). 
    • The level of detail is called the “grain” of the table. 
    • Contain foreign keys for the dimension tables include time dimension. 
    • Reolve many-to-many relationships
    3.2 Dimensions
    • Represent the who, what, where, when and how of a measurement/artifact. 
    • Represent real-world entities not business processes. 
    • Give the context of a measurement (subject). 
    • For example for the Sales fact table, the characteristics of the 'monthly sales number' measurement can be a Location (Where), Time (When), Product Sold (What).  
    • The Dimension Attributes are the various columns in a dimension table. In the Location dimension, the attributes can be Location Code, State, Country, Zip code. Generally the Dimension Attributes are used in report labels, and query constraints such as where Country='USA'. The dimension attributes also contain one or more hierarchical relationships.  
    •  Hierarchical relationships.
    4. Star Schemas and Snowflake Schemas
    4.1 Star Schemas

    A start Schema (Dimensional Model) consists a single fact table of metrics surrounded by multiple descriptive dimension tables. A start schema will consist at less on fact table and a few dimensions tables.

      
    4.2 Snowflake Schema

    A snowflake schema is a set of tables comprised of a single, central fact table surrounded by normalized dimensions. The main different between a star schema and a snowflake schema is that star schema tends to have demoralize dimensions while a snowflake schema prefers normalized dimensions.


    5. Extract, Transform and Load (ETL)
    Extract, Transform and Load (ETL) is the name for a process or a group of processes that transport data from source system into the data warehouse or data mart.
    •  Extract is the fist part of an ETL process. It involves extracting the data from different source systems and different data formats.  
    • Transform refers to Appling business rules and logics to the data that was extracted from the source system before loading them into the data warehouse. Data quality assurance can also be performed here. 
    • Load processes load data, after the transform processes, to the data warehouse and data mart which some time also refer to as targets.  
    6. Meta Data

    Meta data literally means "data about data.". It describes the characteristics of a resource. describes how and when and by whom a particular set of data was collected, and how the data is formatted. Metadata is essential for understanding information stored in data warehouses.
    • A data dictionary is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format.
    7. Cube

    A cube contains dimensions, hierarchies, levels, and measures. Each individual point in a cube is referred to as a cell.

    Cubes are OLAP objects consisting of related measures and dimensions that you configure within an Analysis Services database. You can define and configure multiple cubes within a single database and each cube can use some or all of the same dimensions. You can also define a single cube that contains multiple measure groups in the same database rather than defining separate cubes. When you define a cube with multiple measure groups, you need to define how dimensions relate to each measure group and customize, as appropriate, dimension objects within each cube and measure group. When defining a cube, you also define advance cube properties, including calculations, KPIs, actions, partitions and aggregations, perspectives and translations.

    8. Smmary
    The concepts which described above were the basic concepts would start with any data were house project. Understand those concepts provide a starting point in any data warehouse development. Each of the concepts mentioned above represent an important part in the data warehouse development life cycle. These concepts are fundamental to design and implement a data warehouse. Following diagram show each of the concepts tied into in each data warehouse layers.


    References:

    Friday, July 30, 2010

    IP To Country Lookup Table Using MySQL Spatial Index

    I have an IP to Country lookup table that is used by various applications. The table is very simple:

    CREATE TABLE `lu_ip_to_country` (
         id INT UNSIGNED NOT NULL auto_increment,
         start_ip_num int(10) UNSIGNED NOT NULL,
         end_ip_num int(10) UNSIGNED NOT NULL,
         country_id char(2) NOT NULL,
         iso_country_code varchar(3) DEFAULT NULL,
         country_name varchar(100) DEFAULT NULL,
         PRIMARY KEY (id),
         key start_end_ip (start_ip_num,end_ip_num)
    );
    •  Start_ip_num: Beginning IP Number
    •  End_ip_num: Ending IP Number
    •  Country_id: two character country abbreviations (US, BG, FR, etc.) similar to ISO code
    •  ISO_country_code: ISO 3166 Country Code
    •  Country_name: Country Name

    Beginning IP Number and Ending IP Number are the number version of the IPs which are calculated as follows:
           Ip number = 16777216*w + 65536*x + 256*y + z
    Where w, x, y, z represent the IP address (w, x,y,z). For example: if ip is 61.94.7.255,then w=61,x=94,y=7 and z=255.

     The queries that use to lookup country information are simple as well:
          Select country_id from lu_ip_to_country
          Where v_ip_num_lookup between start_ip_num and end_ip_num

    In this query, the v_ip_num_lookup is a variable used to store the calculated ip number. For example, '61.94.7.255’=1029572607.

    The other query utilized the MySQL INET_ATON() function, which convert the ip to ip number. i.e. INET_ATON('61.94.7.255') = 1029572607. The query is the following:

       Select country_id from lu_ip_to_country
       Where INET_ATON('61.94.7.255') between start_ip_num and end_ip_num

    These tow queries worked fine but slow. It will take up-to 1.5 second to return the result. And they were even slower when there were concurrent requests to this table. I had tried different indexes, and even tried to use memory table, but the improvements were not notable.

    The other week, while I was searching the web, I came across an article, which was written by Jeremy Cole, described how to use MySQL GIS with spatial R-tree indexes for similar ip-to-country lookup in MySQL.I decided to give it a try. And, it turn out that the result was excellence.

    Jeremy Cole’s article gave a vary detail explanation on how GIS with spatial R-tree indexes work in this case. Here are the steps on how I implemented and tested it (following what were described in the article).

    1. Reconstruct the table to using POLYGON type and spatial index

    CREATE TABLE lu_ip_to_country_geoip (
         id INT UNSIGNED NOT NULL auto_increment,
         ip_poly POLYGON NOT NULL,
         start_ip_num int(10) UNSIGNED NOT NULL,
         end_ip_num int(10) UNSIGNED NOT NULL,
         country_id varchar(3) NOT NULL,
         iso_country_code varchchar(3) DEFAULT NULL,
        country_name varchar(100) DEFAULT NULL,
        PRIMARY KEY (id),
        SPATIAL INDEX (ip_poly)
    );

    2. Export the data from the existing table
         Select start_ip_num,
                    end_ip_num,
                    country_id,
                   iso_country_code,
                  country_name
                 Into outfile ‘/tmp/ip-country-data.dat’
         FIELDS
                TERMINATED BY ","
                  ENCLOSED BY "\""
         LINES
               TERMINATED BY "\n"
          FROM lu_ip_to_country;

    If you don’t have the data in the first place, you can download them from MaxMind web site (http://www.maxmind.com/app/country) as mentioned in the article.

    3. Load the data into the new table with the MySQL GIS function to build the POLYGON for ip_poly field from the start_ip_num and end_ip_num fields.

        LOAD DATA LOCAL INFILE "/tmp/ip-country-data.dat"
             INTO TABLE lu_ip_to_country_geoip
             FIELDS
                TERMINATED BY ","
                 ENCLOSED BY "\""
             LINES
                  TERMINATED BY "\n"
               (
                   @start_ip_num, @end_ip_num,
                    @country_id,@iso_country_code, @country_name
                 )
           SET
                id := NULL,
                start_ip_num := @start_ip_num,
               end_ip_num := @end_ip_num,
               ip_poly := GEOMFROMWKB(POLYGON(LINESTRING(
              /* clockwise, 4 points and back to 0 */
              POINT(@start_ip_num, -1), /* 0, top left */
              POINT(@end_ip_num, -1), /* 1, top right */
              POINT(@end_ip_num, 1), /* 2, bottom right */
              POINT(@start_ip_num, 1), /* 3, bottom left */
             POINT(@start_ip_num, -1) /* 0, back to start */))),
           country_id := @country_code,
           iso_country_code := @iso-country_code,
           country_name := @country_string;

    4. Test queries

    mysql> SELECT country_id,iso_country_code,country_name FROM lu_ip_to_country WHERE INET_ATON('61.94.7.255') BETWEEN start_ip_address AND end_ip_address;
    +------------+------------------+--------------+
    | country_id | iso_country_code | country_name |
    +------------+------------------+--------------+
    | ID | IDN | Indonesia |
    +------------+------------------+--------------+
    1 row in set (1.57 sec)

    mysql> SELECT country_id,iso_country_code,country_name FROM lu_ip_to_country_geoip WHERE MBRCONTAINS(ip_poly, POINTFROMWKB(POINT(INET_ATON('61.94.7.255'), 0)));
    +------------+------------------+--------------+
    | country_id | iso_country_code | country_name |
    +------------+------------------+--------------+
    | ID | IDN | Indonesia |
    +------------+------------------+--------------+
    1 row in set (0.35 sec)

    5. Perform A/B Test on ETL process.

    I also selected a reasonable data file and ran through the process using old ip-to-country table for process A and new ip-to-country table for process B and QA’edfor the following:

    •  Check to make sure all Country IDs/Codes match exactly on both out put
    • Compare times
     Both processes produced the same country code lookups except the new table made the same process run much faster.

    References:


    Thursday, June 03, 2010

    Mouse Tricks - Point and Click

    What would have happened if he had used a touch screen instead a mouse at work ?