Data Modeling

data-modeling

Data Modeling ?

The goals of this article are to overview fundamental data modeling skills that all developers should have, skills that can be applied on both traditional projects that take a serial approach to agile projects that take an evolutionary approach. My personal philosophy is that every IT professional should have a basic understanding of data modeling. They don’t should be experts at data modeling, yet they should be prepared to be involved in the creation of such a model, be able to read an existing data model, understand when and when not to create a data model, and appreciate fundamental data design techniques. This article is a short introduction to these skills. The primary audience for this article is application developers who need to gain an understanding of some of the critical activities performed by an Agile DBA. This understanding should lead to an appreciation of what Agile DBAs do and why they do them, and it should assist with bridging the communication gap between these two roles.

able of Contents

  1. What is data modeling?
    • How are data models used in practice?
    • What about conceptual models?
    • Common data modeling notations
  2. How to model data
    • Identify entity types
    • Identify attributes
    • Apply naming conventions
    • Identify relationships
    • Apply data model patterns
    • Assign keys
    • Normalize to reduce data redundancy
    • Denormalize to improve performance
  3. Evolutionary/agile data modeling
  4. How to become better at modeling data

1. What is Data Modeling?

Data modeling is the act of exploring data-arranged structures. Like other modeling artifacts data models can be used for a variety of purposes, from elevated level conceptual models to physical data models. From the point of view of an article arranged developer data modeling is conceptually similar to class modeling. With data modeling you distinguish entity types whereas with class modeling you recognize classes. Data attributes are assigned to entity types just as you would assign attributes and operations to classes. There are associations between entities, similar to the associations between classes – relationships, inheritance, composition, and aggregation are all applicable concepts in data modeling.

Traditional data modeling is unique in relation to class modeling because it focuses solely on data – class models allow you to investigate both the behavior and data aspects of your domain, with a data model you can just investigate data issues. Because of this focus data modelers have an inclination to be vastly improved at getting the data “right” than object modelers. However, some individuals will display database methods (stored procedures, stored functions, and triggers) when they are physical data modeling. It depends on the situation of course, however I personally think that this is a smart thought and advance the idea in my UML data modeling profile (more on this later).

Although the focus of this article is data modeling, there are often alternatives to data-situated artifacts (never forget Agile Modeling’s Multiple Models principle). For example, with regards to conceptual modeling ORM diagrams aren’t your lone alternative – In addition to LDMs it is quite regular for individuals to create UML class diagrams and even Class Responsibility Collaborator (CRC) cards instead. In fact, my experience is that CRC cards are superior to ORM diagrams because it is very easy to get venture stakeholders actively involved in the creation of the model. Instead of a traditional, analyst-drove drawing session you can instead facilitate stakeholders through the creation of CRC cards.

1.1 How are Data Models Used in Practice?

Although methodology issues are covered later, we need to discuss how data models can be used in practice to better understand them. You are likely to see three basic styles of data model:

  • Conceptual data models. These models, sometimes called domain models, are typically used to explore domain concepts with project stakeholders. On Agile teams high-level conceptual models are often created as part of your initial requirements envisioning efforts as they are used to explore the high-level static business structures and concepts. On traditional teams conceptual data models are often created as the precursor to LDMs or as alternatives to LDMs.
  • Logical data models (LDMs). LDMs are used to explore the domain concepts, and their relationships, of your problem domain. This could be done for the scope of a single project or for your entire enterprise. LDMs depict the logical entity types, typically referred to simply as entity types, the data attributes describing those entities, and the relationships between the entities. LDMs are rarely used on Agile projects although often are on traditional projects (where they rarely seem to add much value in practice).
  • Physical data models (PDMs). PDMs are used to design the internal schema of a database, depicting the data tables, the data columns of those tables, and the relationships between the tables. PDMs often prove to be useful on both Agile and traditional projects and as a result the focus of this article is on physical modeling.

Although LDMs and PDMs sound very similar, and they in fact are, the level of detail that they model can be significantly unique. This is because the goals for each diagram is unique – you can use a LDM to investigate domain concepts with your stakeholders and the PDM to define your database design. Figure 1 presents a simple LDM and Figure 2 a simple PDM, both modeling the idea of customers and addresses as well as the relationship between them. The two diagrams apply the Barker notation, summarized below. Notice how the PDM shows greater detail, including an associative table needed to execute the association as well as the keys expected to maintain the relationships. More on these concepts later. PDMs should also mirror your organization’s database naming standards, in this case an abbreviation of the entity name is appended to each segment name and an abbreviation for “Number” was consistently introduced. A PDM should also indicate the data types for the columns, such as integer and char(5). Although Figure 2 does not show them, query tables (also called reference tables or description tables) for how the address is used as well as for states and countries are inferred by the attributes ADDR_USAGE_CODE, STATE_CODE, and COUNTRY_CODE.

A simple logical data model.Simple LDMFigure 2. A simple physical data model.Simple PDMAn important observation about Figures 1 and 2 is that I’m not slavishly following Barker’s approach to naming relationships. For example, between Customer and Address there really should be two names “Each CUSTOMER may be located in one or more ADDRESSES” and “Each ADDRESS may be the site of one or more CUSTOMERS”. Although these names explicitly define the relationship I personally think that they’re visual noise that clutter the diagram. I prefer simple names such as “has” and then trust my readers to interpret the name in each direction. I’ll only add more information where it’s needed, in this case I think that it isn’t. However, a significant advantage of describing the names the way that Barker suggests is that it’s a good test to see if you actually understand the relationship – if you can’t name it then you likely don’t understand it.

Data models can be used effectively at both the enterprise level and on projects. Enterprise architects will often create at least one elevated level LDMs that portray the data structures that support your enterprise, models typically alluded to as enterprise data models or enterprise information models. An enterprise data model is one of several views that your organization’s enterprise architects may choose to maintain and support – other views may investigate your organization/hardware infrastructure, your organization structure, your software infrastructure, and your business processes (to name a couple). Enterprise data models provide information that a task team can use both as a set of constraints as well as important insights into the structure of their system.

Task teams will typically create LDMs as a primary analysis artifact when their implementation environment is predominantly procedural in nature, for example they are using structured COBOL as an implementation language. LDMs are also a decent decision when an undertaking is data-situated in nature, perhaps a data warehouse or reporting system is being developed (having said that, experience seems to show that usage-focused approaches appear to work even better). However LDMs are often a helpless decision when a task team is using object-arranged or part based technologies because the developers would rather work with UML diagrams or when the venture is not data-situated in nature. As Agile Modeling advises, apply the privilege artifact(s) for the work. Or then again, as your grandfather probably advised you, use the correct tool for the work. It’s important to take note of that traditional approaches to Master Data Management (MDM) will often motivate the creation and maintenance of detailed LDMs, an effort that is rarely justifiable in practice when you consider the total cost of ownership (TCO) while calculating the rate of profitability (ROI) of those sorts of efforts.

At the point when a relational database is used for data storage venture teams are best advised to create a PDMs to demonstrate its internal schema. My experience is that a PDM is often one of the critical design artifacts for business application development projects.

2.2. What About Conceptual Models?

Halpin (2001) points out that many data professionals prefer to create an Object-Role Model (ORM), an example is depicted in Figure 3, instead of an LDM for a conceptual model. The advantage is that the notation is very simple, something your project stakeholders can quickly grasp, although the disadvantage is that the models become large very quickly. ORMs enable you to first explore actual data examples instead of simply jumping to a potentially incorrect abstraction – for example Figure 3 examines the relationship between customers and addresses in detail.

Figure 3. A simple Object-Role Model.

My experience is that people will capture information in the best place that they know. As a result I typically discard ORMs after I’m finished with them. I sometimes user ORMs to explore the domain with project stakeholders but later replace them with a more traditional artifact such as an LDM, a class diagram, or even a PDM. As a generalizing specialist, someone with one or more specialties who also strives to gain general skills and knowledge, this is an easy decision for me to make; I know that this information that I’ve just “discarded” will be captured in another artifact – a model, the tests, or even the code – that I understand. A specialist who only understands a limited number of artifacts and therefore “hands-off” their work to other specialists doesn’t have this as an option. Not only are they tempted to keep the artifacts that they create but also to invest even more time to enhance the artifacts. Generalizing specialists are more likely than specialists to travel light.

2.3. Common Data Modeling Notations

Figure 4 presents a summary of the syntax of four common data modeling notations: Information Engineering (IE), Barker, IDEF1X, and the Unified Modeling Language (UML). This diagram isn’t meant to be comprehensive, instead its goal is to provide a basic overview. Furthermore, for the sake of brevity I wasn’t able to depict the highly-detailed approach to relationship naming that Barker suggests. Although I provide a brief description of each notation in Table 1 I highly suggest David Hay’s paper A Comparison of Data Modeling Techniques as he goes into greater detail than I do.

Figure 4. Comparing the syntax of common data modeling notations.Data modeling notation summaryTable 1. Discussing common data modeling notations.

Notation

Comments

IEThe IE notation (Finkelstein 1989) is simple and easy to read, and is well suited for high-level logical and enterprise data modeling. The only drawback of this notation, arguably an advantage, is that it does not support the identification of attributes of an entity. The assumption is that the attributes will be modeled with another diagram or simply described in the supporting documentation.
BarkerThe Barker notation is one of the more popular ones, it is supported by Oracle’s toolset, and is well suited for all types of data models. It’s approach to subtyping can become clunky with hierarchies that go several levels deep.
IDEF1XThis notation is overly complex. It was originally intended for physical modeling but has been misapplied for logical modeling as well. Although popular within some U.S. government agencies, particularly the Department of Defense (DoD), this notation has been all but abandoned by everyone else. Avoid it if you can.
UMLThis is not an official data modeling notation (yet). Although several suggestions for a data modeling profile for the UML exist, none are complete and more importantly are not “official” UML yet. However, the Object Management Group (OMG) in December 2005 announced an RFP for data-oriented models.

3. How to Model Data

It is critical for an application developer to have a grasp of the fundamentals of data modeling so they can not only read data models but also work effectively with Agile DBAs who are responsible for the data-oriented aspects of your project. Your goal reading this section is not to learn how to become a data modeler, instead it is simply to gain an appreciation of what is involved.

The following tasks are performed in an iterative manner:

  • Identify entity types
  • Identify attributes
  • Apply naming conventions
  • Identify relationships
  • Apply data model patterns
  • Assign keys
  • Normalize to reduce data redundancy
  • Denormalize to improve performance

3.1 Identify Entity Types

An entity type, also simply called entity (not exactly accurate terminology, but very common in practice), is similar conceptually to object-orientation’s concept of a class – an entity type represents a collection of similar objects. An entity type could represent a collection of people, places, things, events, or concepts. Examples of entities in an order entry system would include CustomerAddressOrderItem, and Tax. If you were class modeling you would expect to discover classes with the exact same names. However, the difference between a class and an entity type is that classes have both data and behavior whereas entity types just have data.Ideally an entity should be normal, the data modeling world’s version of cohesive. A normal entity depicts one concept, just like a cohesive class models one concept. For example, customer and order are clearly two different concepts; therefore it makes sense to model them as separate entities.

3.2 Identify Attributes

Each entity type will have one or more data attributes. For example, in Figure 1 you saw that the Customer entity has attributes such as First Name and Surname and in Figure 2 that the TCUSTOMER table had corresponding data columns CUST_FIRST_NAME and CUST_SURNAME (a column is the implementation of a data attribute within a relational database).Attributes should also be cohesive from the point of view of your domain, something that is often a judgment call. – in Figure 1 we decided that we wanted to model the fact that people had both first and last names instead of just a name (e.g. “Scott” and “Ambler” vs. “Scott Ambler”) whereas we did not distinguish between the sections of an American zip code (e.g. 90210-1234-5678). Getting the level of detail right can have a significant impact on your development and maintenance efforts. Refactoring a single data column into several columns can be difficult, database refactoring is described in detail in Database Refactoring, although over-specifying an attribute (e.g. having three attributes for zip code when you only needed one) can result in overbuilding your system and hence you incur greater development and maintenance costs than you actually needed.

3.3 Apply Data Naming Conventions

Your organization should have standards and guidelines applicable to data modeling, something you should be able to obtain from your enterprise administrators (if they don’t exist you should lobby to have some put in place). These guidelines should include naming conventions for both logical and physical modeling, the logical naming conventions should be focused on human readability whereas the physical naming conventions will reflect technical considerations. You can clearly see that different naming conventions were applied in Figures 1 and 2.As you saw in Introduction to Agile Modeling, AM includes the Apply Modeling Standards practice. The basic idea is that developers should agree to and follow a common set of modeling standards on a software project. Just like there is value in following common coding conventions, clean code that follows your chosen coding guidelines is easier to understand and evolve than code that doesn’t, there is similar value in following common modeling conventions.

3.4 Identify Relationships

In the real world entities have relationships with other entities. For example, customers PLACE orders, customers LIVE AT addresses, and line items ARE PART OF orders. Place, live at, and are part of are all terms that define relationships between entities. The relationships between entities are conceptually identical to the relationships (associations) between objects.Figure 5 depicts a partial LDM for an online ordering system. The first thing to notice is the various styles applied to relationship names and roles – different relationships require different approaches. For example the relationship between Customer and Order has two names, places and is placed by, whereas the relationship between Customer and Address has one. In this example having a second name on the relationship, the idea being that you want to specify how to read the relationship in each direction, is redundant – you’re better off to find a clear wording for a single relationship name, decreasing the clutter on your diagram. Similarly you will often find that by specifying the roles that an entity plays in a relationship will often negate the need to give the relationship a name (although some CASE tools may inadvertently force you to do this). For example the role of billing address and the label billed to are clearly redundant, you really only need one. For example the role part of that Line Item has in its relationship with Order is sufficiently obvious without a relationship name.Figure 5. A logical data model (Information Engineering notation).