Dimensional Data Modeling

dimensional-data-modeling

Dimensional Data Modeling

What is dimensional data modeling?

What are the objectives of Data Warehouse Modeling?

The objectives, laid out by Ross and Kimball, are straightforward:

  • present information consistently
  • adaptable and receptive to change
  • present information in a timely way
  • protect information assets
  • serve as an authoritative and trustworthy foundation for improved decision making (single source of truth in Data Engineering language)
  • the VIPs must accept your system

Why Modeling?

As a data engineer, you know SQL very well and can probably write SQL queries for all day long. But you cannot assume that the typical end-user will be an expert on writing SQL queries. So, our objective is to build a DW so easy for analysts to write analysis queries quickly and effectively.

Things you would not want your analysts to do:

  • Queries based on ID
  • Cascading outer joins(even you would not want to do that)
  • grouped or joined multiple subqueries
  • recursive subqueries(just visit Hackerrank SQL and you would understand the pain)
  • Subquery correlation: fetching data across multiple columns in different subqueries
  • Joins without PK/FK: even for me(1.5 years of DE experience), it’s hard to visualize.

Things you could and should expect from your analysts:

  • simple joins
  • columns with names and comprehensive text
  • simple aggregation
  • analytical windowed functions
  • DISTINCT

Transitioning from transactional databases to facts

Most businesses measure their success and proficiency by measuring certain types of data. This data captures real business activities and progress. This data is called as facts.

The OLTP arranged databases record transactions at a time, kind of like event streaming however centralized around transactions. The DW is unique. DW does not have to record details at the transactional level. DW needs to have facts across various criteria of your business. DW needs to aggregate(or let analysts aggregate) the information needed to improve business. And so, redundancy is an unforgivable sin in DW.

What are measures and why should you fill your fact tables with them?

Maintaining Consistent Grain

Why do you need to maintain a consistent grain?

So that you can ensure that your system can legitimately correlate and aggregate across facts.

  • Accumulating Snapshot Fact Tables

Periodic Snapshot Fact Tables

As the name suggests, they are gathered at regular time intervals. Consumption of gas, audit, and inspections are some instances of data assortment that have occasional snapshots enabled for them.

Accumulating Snapshot Fact Tables

At the point when a business performance indicator is a rate of completing a multi-step business process, you may want to capture at the grain of the sum of the process and record the start, the culmination, and the steps in between. This may be transaction-grained, however it has a great deal of measures in between. So, you use accumulating snapshot fact tables to answer complex questions in business intelligence where there is the passing of time between facts. One genuine example would be a fact table column of your ordering a chicken sandwich and the fact table line of the bag being handed through the window of the car at McDonald’s drive-through.

Locating Facts via Dimensions

You and your analysts need to know how to query and filter facts to derive business intelligence from them. This purpose is served by dimensions.

Drawing Dimensions from Reference and Raw-Meta data

Dimensions are almost always created with a surrogate key; the surrogate key, naturally, is referred to by the foreign key (or keys) in the fact table. We search the table by searching the dimensions in which we are interested. All of the other data describing our facts, such as timestamps, customer agents, store location, item, and customer are what we transform into dimensions.

The beauty of dimensional modeling is that facts are not defined by the primary keys or any sort of special identifier, instead, they are defined by the combination of dimensions. This gives rise to Star Schema.

Image for post

Hierarchy in Dimensions

Consider the following two images.

Image for post

[3]

Image for post

[3]

Image for post

[3]

Slowly Changing Dimensions

As much as I would like to write about it, I still think that it’s better for my readers to understand this concept thoroughly from here.

Integrate your Big Data Into Your ETL System

You will treat your tabular Big Data as having been acquired through one of your standard Extract phases. Thus, you will apply to it the same steps you did in transform:

  1. Conforming units and formats
  2. De-duplication
  3. Restructuring
  4. Staging

Summary

I wanted to understand the theoretical aspects of database design which led me to read the book, Ross and Kimball. I then became curious to draw differences and analogies in their methods and that of today’s leading data-driven companies like Netflix, Airbnb, Uber, etc.