Dimensional Data Modeling
What is dimensional data modeling?
Dimensional modeling (DM) is part of the Business Dimensional Lifecycle methodology developed by Ralph Kimball which includes a set of methods, techniques and concepts for use in data warehouse design. The approach focuses on identifying the key business processes within a business and modelling and implementing these first before adding additional business processes, a bottom-up approach. 
What are the objectives of Data Warehouse Modeling?
The objectives, laid out by Ross and Kimball, are straightforward:
- make information easily accessible
- present information consistently
- adaptable and receptive to change
- present information in a timely way
- protect information assets
- serve as an authoritative and trustworthy foundation for improved decision making (single source of truth in Data Engineering language)
- the VIPs must accept your system
If you worked on or used an ETL system, you would have noticed that information consistency is achieved by conforming measures, timeliness provided by ETL cycles and adaptability also depends largely on the ETL design.
As a data engineer, you know SQL very well and can probably write SQL queries for all day long. But you cannot assume that the typical end-user will be an expert on writing SQL queries. So, our objective is to build a DW so easy for analysts to write analysis queries quickly and effectively.
Things you would not want your analysts to do:
- Queries based on ID
- Cascading outer joins(even you would not want to do that)
- grouped or joined multiple subqueries
- recursive subqueries(just visit Hackerrank SQL and you would understand the pain)
- Subquery correlation: fetching data across multiple columns in different subqueries
- Joins without PK/FK: even for me(1.5 years of DE experience), it’s hard to visualize.
Things you could and should expect from your analysts:
- simple joins
- columns with names and comprehensive text
- simple aggregation
- analytical windowed functions
Note, the above points are not the easy ones and that your system should be scalable enough for all these kinds of queries.
The OLTP databases are transformed into facts and dimensions due to the aforementioned objectives.
Transitioning from transactional databases to facts
Most businesses measure their success and proficiency by measuring certain types of data. This data captures real business activities and progress. This data is called as facts.
The OLTP arranged databases record transactions at a time, kind of like event streaming however centralized around transactions. The DW is unique. DW does not have to record details at the transactional level. DW needs to have facts across various criteria of your business. DW needs to aggregate(or let analysts aggregate) the information needed to improve business. And so, redundancy is an unforgivable sin in DW.
What are measures and why should you fill your fact tables with them?
In a data warehouse, a measure is a property on which calculations can be made.
The facts we derive from the operational data stores come with some additional data which is typically summed in our analysis. These are the aspects of a fact that allow the analyst, or the executive viewing the analysis, to see the value in the fact.
Maintaining Consistent Grain
Why do you need to maintain a consistent grain?
So that you can ensure that your system can legitimately correlate and aggregate across facts.
But its not always possible to have data at atomic levels. So, to bridge this gap, there are two methods:
- Periodic Snapshot Fact Tables
- Accumulating Snapshot Fact Tables
Periodic Snapshot Fact Tables
As the name suggests, they are gathered at regular time intervals. Consumption of gas, audit, and inspections are some instances of data assortment that have occasional snapshots enabled for them.
Accumulating Snapshot Fact Tables
At the point when a business performance indicator is a rate of completing a multi-step business process, you may want to capture at the grain of the sum of the process and record the start, the culmination, and the steps in between. This may be transaction-grained, however it has a great deal of measures in between. So, you use accumulating snapshot fact tables to answer complex questions in business intelligence where there is the passing of time between facts. One genuine example would be a fact table column of your ordering a chicken sandwich and the fact table line of the bag being handed through the window of the car at McDonald’s drive-through.
Locating Facts via Dimensions
You and your analysts need to know how to query and filter facts to derive business intelligence from them. This purpose is served by dimensions.
Drawing Dimensions from Reference and Raw-Meta data
Dimensions are almost always created with a surrogate key; the surrogate key, naturally, is referred to by the foreign key (or keys) in the fact table. We search the table by searching the dimensions in which we are interested. All of the other data describing our facts, such as timestamps, customer agents, store location, item, and customer are what we transform into dimensions.
The beauty of dimensional modeling is that facts are not defined by the primary keys or any sort of special identifier, instead, they are defined by the combination of dimensions. This gives rise to Star Schema.
It is very important that we have a uniqueness in our dimensions. When we get to queries across facts, duplicates among dimension combinations will turn into a disaster. If you can’t, then add or aggregate dimensions to make ’em unique.
Hierarchy in Dimensions
Consider the following two images.
An analyst will have an easy life if you setup the second dimension table for him/her.
So, with the second table, you have the following hierarchy:
There are all kinds of hierarchies — multiple hierarchies, single hierarchy, etc. I am not addressing them in this blog post.
One thing I would like to point out is that the time dimension is a real pain in the neck. You have to take care of the magic days, the fiscal calendar, the time zones, the cycles(profit over quarter types). And don’t be lousy or over-confident in this, even time-series databases won’t help you in hierarchies if your ETL is messed up. You might want to take a look at Outrigger dimensions. There are also occasions in which one dimension is naturally dependent on another. In such a case, designers may put a foreign key from the one to the other. This is what constitutes an “outrigger dimension. In calendar dimensions, this is very common.
You cannot use a date at a different grain in an outrigger from the dates you use in the fact table. You cannot allow aggregation over outrigger dimensions. If need be, mask numeric values in the outrigger with textual prefixes or suffixes to obstruct this. 
Slowly Changing Dimensions
As much as I would like to write about it, I still think that it’s better for my readers to understand this concept thoroughly from here.
I am not discussing Snowflake dimensions but just to point out, they are still in use with OLAP databases.
Integrate your Big Data Into Your ETL System
You will treat your tabular Big Data as having been acquired through one of your standard Extract phases. Thus, you will apply to it the same steps you did in transform:
- Data cleansing
- Conforming units and formats
I wanted to understand the theoretical aspects of database design which led me to read the book, Ross and Kimball. I then became curious to draw differences and analogies in their methods and that of today’s leading data-driven companies like Netflix, Airbnb, Uber, etc.
In this quest, I can fairly say that the structured format of dimensional modeling is preferred over just a hardcore ETL. Because in this way, you remove the dependency on you, your BI team doesn’t ring you up on Slack to create a new DAG for every other insight, instead, with correct modeling, you enable them to act and explore freely without your need.