Bringing the Development World under the Data Management Umbrella

We have Data at Rest, Data in Transit, and Data in Use. Data in Use has been covered pretty well by Data Architecture and Data Governance teams. Data Architects design databases, which are one of the main storage mechanisms, and tools like ER/Studio can categorise them against business glossaries and feed the models into Data Governance tools like Collibra. But Data in Transit and Data in Process, the data used within applications, is often overlooked. Often we have a large team of developers building applications and application messaging systems such as Enterprise Server Bus or Middleware layers outside of the core data architecture. Wouldn't it be great to have a common data ecosystem with Data Governance, Data Architecture, Data Analytics, Application Integration, and Application Designers all working together?

So what do we need to do? 

In order to manage and govern our data, we need a model of three parts

  1. A Business Glossary
    1. List of Business Terms describing pieces of information important to the business, Each Term will have an accepted definition, useful classifications like sensitivity, is it personal data, and importance, along with any policies and rules around this data. This model is usually built and managed by business folks including Subject Matter Experts and business owners.
  2. An Enterprise Data Model
    1. A structured model of the important and common information entities of the organization perhaps with a high level Conceptual Model and a more detailed Logical Model that is structured to Third Normal Form and can be used to design data assets. A model of the information of the organization, independent of implementation, that can be used to design and document data assets.
  3. A Technical Data Dictionary
    1. A catalog of data assets with a model containing the technical metadata representing the schema or structure of these assets such as Table and Column, or Field names and datatypes. These are closely aligned to the implementation of the data in the technology of the asset.

The picture above shows all of these models joined together. This allows us to navigate the model to help:

  1. Understand what a piece of data in a data asset means and what rules apply to it
  2. Understand in which data assets a piece of information is stored, transported, or used
  3. Allow us to design data assets quickly and using standardized structures and formats
  4. Minimize the risk to the organization by ensuring that sensitive, personal and valuable data is protected and used appropriately
  5. Gives us quicker access to the right data when we want to ask questions, or use data

Seems simple doesn’t it? Data Stewards in the Data Governance team will already have a Business Glossary and a catalog of data assets, although usually databases. Data Architects usually already have an Enterprise Data Model and models of data assets stored in databases; Data At Rest.  But what about Data in Transit and Data in Process?

One of the challenges is the data assets in Data in Transit and Data in Process are based on JavaScript Object Notation (JSON), a hierarchical text-based data format that are usually denormalized for performance. The models used by data architects to design the information of the organization are in relational highly normalized models for efficiency in storage. If only we could convert one from the other and easily connect them.

ER/Studio Data Architect now allows you to use a standard logical data model and, with a little extra markup, repeatably generate hierarchical data structures such as those in JSON, MongoDB and Google BigQuery that includes denormalization patterns.

Likewise, when reverse engineering a hierarchical data structure can generate a relational logical data model identifying repeated common entities.

Consider the following logical data model diagram.

We have books containing containing chapters containing sections. We have libraries and patrons that can checkout those books. Publishers, libraries and patrons all have an address.

This is a typical logical data model expressed as an Entity Relation Diagram expressed in Third Normal Form (3NF). From this we can generate a physical model for most popular database products in ER/Studio. 

In a hierarchical model such as MongoDB, Google BigQuery or JSON we can nest objects inside others. For instance we could have a JSON array of book objects and each one contains a field containing an array of chapters which in turn contain an array of sections.

We may also want to consider denormalization for performance. We may want Google BigQuery tables or MongoDB collections of publishers, libraries, and patrons. In each of these we may want potentially repeated data for its address. We may want to decide whether the checkout exists as a substructure of a book, checkout, or library or maybe all three.

In ER/Studio we have a new feature where we can add to relationship lines whether the relationship will be ‘containing’ and the direction of containment. For instance, the identifying relationship between book and chapter will be containing with book containing chapter. But the non-identifying relationship between publisher and address will also be containing but in the opposite direction to the relationship. 

When we generate the physical model we utilise these containment properties to decide how we create the hierarchical model, which objects are contained in which and the denormalization. The model above will repeatably generate a MongoDB physical model in ER/Studio as follows:

If we want different containment for our plain JSON model we could change the containment flags and regenerate or use Compare/merge to propagate the differences. Or just add new contaminant lines direct into the physical model, and maybe update the logical later again using Compare/merge.

As usual once you have designed your physical data model you can generate the schema for your data asset. For JSON you can output an JSON instance file or a JSON Schema file.

So from a single Logical Data Model you can generate a relational database and/or a hierarchical structure supporting polyglot persistence.

The initial logical data model could be a fragment of the Enterprise Logical Data model acting as a canonical model for your JSON messaging infrastructure. Like with any model you can tie this to Business Terms and publish to your Data Governance tool.

ER/Studio makes the creation of hierarchical structures so much easier, quicker and traceable back to the enterprise data architecture. Your development team can leverage the knowledge base of the data architecture team and rapidly produced well documented structures within a Governance framework.