Kimball: Star and Snowflake Schemas

What are the Star and Snowflake Schemas?

Introduction

In simple terms, both the star and snowflake schemas are a way of housing data in a structure that facilitates reporting, this is often referred to as a “datamart” and forms the central pillar of the Kimball paradigm.

A large data warehouse (OLTP / normalised database) might contain all the data a company wishes analyse, but quite often it is unsuitable for reporting due to its size and complexity. The Star and Snowflake schemas are often used to segregate a company’s data into manageable “pots”, these are usually owned by departments; finance, customer services, warehousing, etc.

There are two main reasons for this segregation:

  1. Setting up a “datamart” for a department rather than a company reduces the scope of the project.
  2. Different departments might want to see different things from their data.

Kimball’s Design: Star Schema

Ralph Kimball’s star schema is incredibly popular in the data warehousing world; the simplicity of the design can make reporting easy to build, small-medium sized datamarts can also be incredibly efficient to use and easy for a business to maintain.

StarSchema_Diagram

The fact table (center) is a combination of “facts” a user might be interested in; total sales value, date joined, etc. and gives a reference (commonly referred to as a surrogate key) for the related dimensions. Each dimensional key residing in the fact table can be linked multiple times, but it must relate to one and only one key in the associated dimension.

Case Study:

We will take a very simple case to build our study. A recruitment company wishes to build a new datamart for their candidate base; they wish to use this data to build a report that gives a listing of anyone with a specified occupation. They are only interested in storing the location (address) of their candidates and his / her occupation in a database, there are no further requirements at this time.

A star schema for those relations might look something like this:

StarSchema_Demo

The address is split out from the candidate name; two people could have the same address, likewise the occupation would also become a separate dimension (a candidate could have several occupations). Since data relating to the occupation, address and name details are held in dimensions and referenced by a key, we are effectively reducing the amount of overall data (redundancy) held within the database, but we are not losing access to the information.

Kimball’s Design: Snowflake Schema

The principle behind a Snowflake schema is exactly the same as a star schema; there is always a central fact table, but the associated dimensions can be multi-layered. When properly utilised, the performance of a large data warehouse can be significantly improved by moving to a snowflake schema.

StarSchema_Diagram2

Case Study:

To create a snowflake, we will build on the star schema example from earlier; a new requirement has come in, and the recruitment company now want to hold details of the address type, if it is a residential or business. They have also asked that their data be divided into regions, as that will allow their reporting to show candidates more suitable to their customer needs. Addresses are comprised of multiple elements, and some of those are recurring; towns, counties, postcodes, etc.

A star schema could easily support these new requirements, but by splitting our address regions into a sub-dimension, we can utilise a snowflake schema to reduce the data a little more.

StarSchema_Demo2

We have moved the region details into a new sub-dimension, and the address dimension now has a key to relate to our newly formed sub-dimension. Likewise, the requirement of storing the address type exists within a new sub-dimension, and again is related back to the address.

Conclusion:

As always, appropriate planning and requirement gathering stages are fundamental to the design process.

Snowflake schemas can often become overly complex if not designed and implemented properly, and could damage user confidence. Likewise, overly large star schemas can be slow to query, and that could cause frustration fro the end users towards the data project.

In breaking out a design from a star to a snowflake it is important to remember that while mathematically it might seem significantly more efficient, this is not meant to be an exercise in normal form; the business users are effectively the stakeholders and the design not only has to be able to service their needs, it has to make sense to those that use it.

Go Back

If you have a question, please use the contact form below to get in touch: