1. Introduction:
Over the years, I had done a few data warehouse projects and read a few data warehouse and business intelligence books and white pages. However, I had never really summarized the concepts, terminologies, and techniques that I had learned. Here, I will start to put together the basic concepts, and terminologies that are frequently referred to and mythologies and techniques that are frequently used and talked about.
I am going to organize the topic base on the following data warehouse concepts. In this note, I will provide summarize version of the concepts. In the future note, I will provide more explanation and example of the each concept.
- Data warehouse and data mart
- Fact and Dimension
- Star Schema and Snowflake Schema
- Extract, Transform and Load (ETL)
- Meta Data
- Cube
2.1 Data Warehouse
A data warehouse is a central repository for all or significant parts of the data that an enterprise's various business systems collect. The goal of the data warehouse is make enterprise data easily accessible for strategic decision making.
There are two well-known authors on data warehousing: Bill Inmon and Ralph Kimball. Following are definitions from each of them:
- Subject-oriented - Data that gives information about a particular subject instead of about a company's on-going operations.
- Integrated - Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole.
- Time-variant - All data in the data warehouse is identified with a particular time period.
- Non-volatile - Data is stable in a data warehouse. More data is added, but data is never removed. This enables management to gain a consistent picture of the business.
2.2 Data Mart
A data mart is a repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers. In scope, the data may derive from an enterprise-wide database or data warehouse or be more specialized. The emphasis of a data mart is on meeting the specific demands of a particular group of knowledge users in terms of analysis, content, presentation, and ease-of-use. Users of a data mart can expect to have data presented in terms that are familiar.
2.3 Data Warehouse vs. Data Mart:
- A data warehouse tends to be a strategic but somewhat unfinished concept. The design of a data warehouse tends to start from an analysis of what data already exists and how it can be collected in such a way that the data can later be used. A data warehouse is a central aggregation of data (which can be distributed physically);
- A data mart tends to be tactical and aimed at meeting an immediate need. The design of a data mart tends to start from an analysis of user needs. A data mart is a data repository that may derive from a data warehouse or not and that emphasizes ease of access and usability for a particular designed purpose.
3.1 Facts
A fact table is the central table in a star join schema characterized by a composite key, each of whose elements is a foreign key drawn from a dimension table.
- Represent a business process, i.e., models the business process as an artifact in the data model
- Contain the measurements or metrics or facts of business processes.
- Most are additive (sales this month), some are semi-additive (balance as of), some are not additive (unit price).
- The level of detail is called the “grain” of the table.
- Contain foreign keys for the dimension tables include time dimension.
- Reolve many-to-many relationships
- Represent the who, what, where, when and how of a measurement/artifact.
- Represent real-world entities not business processes.
- Give the context of a measurement (subject).
- For example for the Sales fact table, the characteristics of the 'monthly sales number' measurement can be a Location (Where), Time (When), Product Sold (What).
- The Dimension Attributes are the various columns in a dimension table. In the Location dimension, the attributes can be Location Code, State, Country, Zip code. Generally the Dimension Attributes are used in report labels, and query constraints such as where Country='USA'. The dimension attributes also contain one or more hierarchical relationships.
- Hierarchical relationships.
4. Star Schemas and Snowflake Schemas
4.1 Star SchemasA start Schema (Dimensional Model) consists a single fact table of metrics surrounded by multiple descriptive dimension tables. A start schema will consist at less on fact table and a few dimensions tables.
A snowflake schema is a set of tables comprised of a single, central fact table surrounded by normalized dimensions. The main different between a star schema and a snowflake schema is that star schema tends to have demoralize dimensions while a snowflake schema prefers normalized dimensions.
5. Extract, Transform and Load (ETL)
Extract, Transform and Load (ETL) is the name for a process or a group of processes that transport data from source system into the data warehouse or data mart.- Extract is the fist part of an ETL process. It involves extracting the data from different source systems and different data formats.
- Transform refers to Appling business rules and logics to the data that was extracted from the source system before loading them into the data warehouse. Data quality assurance can also be performed here.
- Load processes load data, after the transform processes, to the data warehouse and data mart which some time also refer to as targets.
Meta data literally means "data about data.". It describes the characteristics of a resource. describes how and when and by whom a particular set of data was collected, and how the data is formatted. Metadata is essential for understanding information stored in data warehouses.
- A data dictionary is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format.
A cube contains dimensions, hierarchies, levels, and measures. Each individual point in a cube is referred to as a cell.
Cubes are OLAP objects consisting of related measures and dimensions that you configure within an Analysis Services database. You can define and configure multiple cubes within a single database and each cube can use some or all of the same dimensions. You can also define a single cube that contains multiple measure groups in the same database rather than defining separate cubes. When you define a cube with multiple measure groups, you need to define how dimensions relate to each measure group and customize, as appropriate, dimension objects within each cube and measure group. When defining a cube, you also define advance cube properties, including calculations, KPIs, actions, partitions and aggregations, perspectives and translations.
8. Smmary
The concepts which described above were the basic concepts would start with any data were house project. Understand those concepts provide a starting point in any data warehouse development. Each of the concepts mentioned above represent an important part in the data warehouse development life cycle. These concepts are fundamental to design and implement a data warehouse. Following diagram show each of the concepts tied into in each data warehouse layers.References:
W. H. Inmon: Building The Data Warehouse ( 2005)
Rallph Kimball and Margy Ross: The Data Warehouse Toolkit (Second Edition)
1 comment:
An outstanding share! I have just forwarded this onto a coworker who has been conducting
a little research on this. And he in fact bought me dinner due to the fact that I discovered it
for him... lol. So let me reword this.... Thanks for the meal!!
But yeah, thanks for spending the time to talk about this topic here on your web
site.
Post a Comment