Data Warehouse Architecture

datawarehouse

Data Warehouse Architecture: Traditional vs. Cloud

A data warehouse is an electronic framework that gathers data from a wide range of sources within an organization and utilizations the data to support management dynamic.

Organizations are increasingly moving towards cloud-based data warehouses rather than traditional on-premise frameworks. Cloud-based data warehouses differ from traditional warehouses in the accompanying manners:

There is no compelling reason to purchase physical hardware.

It’s quicker and cheaper to set up and scale cloud data warehouses.

Cloud-based data warehouse architectures can normally perform complex systematic queries a lot faster in light of the fact that they utilize enormously parallel processing (MPP).

The rest of this article covers traditional data warehouse architecture and introduces some architectural thoughts and ideas utilized by the most popular cloud-based data warehouse services.

For more subtleties, see our page about data warehouse ideas in this guide.

Traditional Data Warehouse Architecture

The following concepts highlight some of the established ideas and design principles used for building traditional data warehouses.

Three-Tier Architecture

Traditional data warehouse architecture employs a three-tier structure composed of the following tiers.

Base tier: This tier contains the database server used to extract data from various sources, for example, from transactional databases utilized for front-end applications.

Center tier: The center tier houses an OLAP server, which transforms the data into a structure better suited for examination and complex querying. The OLAP server can work in two different ways: either as an all-inclusive relational database management framework that maps the operations on multidimensional data to standard relational operations (Relational OLAP), or utilizing a multidimensional OLAP model that directly executes the multidimensional data and operations.

Top tier: The top tier is the customer layer. This tier holds the tools utilized for elevated level data examination, querying reporting, and data mining.

datawarehouse

Kimball vs. Inmon

Two pioneers of data warehousing named Bill Inmon and Ralph Kimball had different approaches to data warehouse plan.

Ralph Kimball’s approach stressed the importance of data marts, which are repositories of data having a place with particular lines of business. The data warehouse is basically a combination of different data marts that facilitates reporting and investigation. The Kimball data warehouse configuration utilizes a “base up” approach.

Bill Inmon regarded the data warehouse as the centralized repository for all enterprise data. In this approach, an organization first creates a normalized data warehouse model. Dimensional data marts are then created dependent on the warehouse model. This is known as a top-down approach to data warehousing.

Data Warehouse Models

In a traditional architecture there are three basic data warehouse models: virtual warehouse, data mart, and enterprise data warehouse:

  • A virtual data warehouse is a set of separate databases, which can be queried together, so a user can viably get to all the data as though it was stored in one data warehouse.
  • A data mart model is utilized for business-line explicit reporting and examination. In this data warehouse model, data is aggregated from a range of source frameworks relevant to a particular business area, for example, deals or fund.
  • An enterprise data warehouse model prescribes that the data warehouse contain aggregated data that traverses the entire organization. This model sees the data warehouse as the heart of the enterprise’s information framework, with integrated data from all business units.

Star Schema vs. Snowflake Schema

The star construction and snowflake pattern are two different ways to structure a data warehouse.

The star construction has a centralized data repository, stored in a reality table. The pattern splits the reality table into a series of denormalized measurement tables. The reality table contains aggregated data to be utilized for reporting purposes while the measurement table describes the stored data.

Denormalized plans are less intricate on the grounds that the data is grouped. The reality table uses just one connect to join to each measurement table. The star outline’s simpler plan makes it a lot easier to write complex queries.

datawarehouse

The snowflake outline is different on the grounds that it normalizes the data. Normalization implies proficiently organizing the data with the goal that all data conditions are characterized, and each table contains negligible redundancies. Single measurement tables in this way branch out into separate measurement tables.

The snowflake mapping utilizes less plate space and better preserves data integrity. The principle drawback is the complexity of queries required to get to data—each query must burrow deep to get to the relevant data in light of the fact that there are different joins.

datawarehouse

ETL vs. ELT

ETL and ELT are two different strategies for stacking data into a warehouse.

Extract, Transform, Load (ETL) first extracts the data from a pool of data sources, which are normally transactional databases. The data is held in a temporary organizing database. Transformation operations are then performed, to structure and convert the data into a suitable form for the target data warehouse framework. The structured data is then stacked into the warehouse, ready for examination.

datawarehouse

With Extract Load Transform (ELT), data is immediately loaded after being extracted from the source data pools. There is no staging database, meaning the data is immediately loaded into the single, centralized repository. The data is transformed inside the data warehouse system for use with business intelligence tools and analytics.

Organizational Maturity

The structure of an organization’s data warehouse also depends on its current situation and needs.

The basic structure lets end users of the warehouse directly access summary data derived from source systems and perform analysis, reporting, and mining on that data. This structure is useful for when data sources derive from the same types of database systems.

A warehouse with a staging area is the next logical step in an organization with disparate data sources with many different types and formats of data. The staging area converts the data into a summarized structured format that is easier to query with analysis and reporting tools.

A variation on the arranging structure is the addition of data marts to the data warehouse. The data marts store summarized data for a particular line of business, making that data effectively available for explicit forms of investigation. For instance, including data marts can permit a budgetary analyst to more effectively perform nitty gritty queries on deals data, to make predictions about customer behavior. Data marts make examination easier by tailoring data explicitly to address the issues of the end user.

datawarehouse

New Data Warehouse Architectures

In recent years, data warehouses are moving to the cloud. The new cloud-based data warehouses do not adhere to the traditional architecture; each data warehouse offering has a unique architecture.

This section summarizes the architectures used by two of the most popular cloud-based warehouses: Amazon Redshift and Google BigQuery.

Amazon Redshift

Amazon Redshift is a cloud-based representation of a traditional data warehouse.

Redshift requires computing resources to be provisioned and set up in the form of clusters, which contain a collection of one or more nodes. Each node has its own CPU, storage, and RAM. A leader node compiles queries and transfers them to compute nodes, which execute the queries.

On each node, data is stored in chunks, called slices. Redshift uses a columnar storage, meaning each block of data contains values from a single column across a number of rows, instead of a single row with values from multiple columns.

Google BigQuery

BigQuery’s architecture is serverless, meaning Google dynamically manages the allocation of machine resources. All resource management decisions are, therefore, hidden from the user.

BigQuery lets clients load data from Google Cloud Storage and other readable data sources. The alternative option is to stream data, which allows developers to add data to the data warehouse in real-time, row-by-row, as it becomes available.

BigQuery uses a query execution engine named Dremel, which can scan billions of rows of data in just a few seconds. Dremel uses massively parallel querying to scan data in the underlying Colossus file management system. Colossus distributes files into chunks of 64 megabytes among many computing resources named nodes, which are grouped into clusters.

Dremel uses a columnar data structure, similar to Redshift. A tree architecture dispatches queries among thousands of machines in seconds.

 

Simple SQL commands are used to perform queries on data.

Panoply

Panoply provides end-to-end data management-as-a-service. Its unique self-optimizing architecture utilizes machine learning and natural language processing (NLP) to model and streamline the data journey from source to analysis, reducing the time from data to value as close as possible to none.

Panoply’s smart data infrastructure includes the following features:

  • Analyzing of queries and data – identifying the best configuration for each use case, adjusting it over time, and building indexes, sortkeys, diskeys, data types, vacuuming, and partitioning.
  • Identifying queries that do not follow best practices – such as those that include nested loops or implicit casting – and rewrites them to an equivalent query requiring a fraction of the runtime or resources.
  • Optimizing server configurations over time based on query patterns and by learning which server setup works best. The platform switches server types seamlessly and measures the resulting performance.

Beyond Cloud Data Warehouses

Cloud-based data warehouses are a big step forward from traditional architectures. However, users still face several challenges when setting them up:

  • Loading data to cloud data warehouses is non-trivial, and for large-scale data pipelines, it requires setting up, testing, and maintaining an ETL process. This part of the process is typically done with third-party tools.
  • Updates, upserts, and deletions can be tricky and must be done carefully to prevent degradation in query performance.
  • Semi-structured data is difficult to deal with – needs to be normalized into a relational database format, which requires automation for large data streams.
  • Nested structures are typically not supported in cloud data warehouses. You will need to flatten nested tables into a format the data warehouse can understand.
  • Optimizing your cluster—there are different options for setting up a Redshift cluster to run your workloads. Different workloads, data sets, or even different types of queries might require a different setup. To stay optimal you’ll need to continually revisit and tweak your setup.
  • Query optimization—user queries may not follow best practices, and consequently will take much longer to run. You may find yourselves working with users or automated client applications to optimize queries so that the data warehouse can perform as expected.
  • Backup and recovery—while the data warehouse vendors provide numerous options for backing up your data, they are not trivial to set up and require monitoring and close attention.

Panoply is a Smart Data Warehouse that adds a layer of automation that takes care of all of the complex tasks above, saving valuable time and helping you get from data to insight in minutes.