Big Data = Bigger Need for Data Models – by Rob Loranger

by Nov 26, 2014

Big Data has had a great impact on science, healthcare, government and many other industries. For instance, it is now possible for scientists to map and analyze the human genome, for healthcare organizations to provide continuous monitoring of patient health, and for financial institutions to greatly improve their ability to detect fraud. In addition, as of 2012, Big Data has been used to help win presidential elections.
As described in the article Barack Obama’s Big Data won the US election, President Obama’s campaign team was able to micro-target potential candidates by analyzing massive amounts of personal information captured from campaign websites and social media platforms. This allowed the team to personalize the campaign at a level of granularity never achieved before. In this way, Big Data not only helped Barack Obama win the 2012 US presidential election, but it has likely changed the way presidential campaigns will forever be conducted.ERStudio-Karen Lopez-159x228-watch now-Webinar banners-20140731
Besides the common thread of Big Data, examples like these often use Apache Hadoop as the underlying data management platform. Due to Hadoop’s schema-less architecture and its ability to horizontally scale over commodity hardware, data storage is no longer the limiting factor for performing in-depth analytics. Hadoop gives data professionals the ability to keep up with the volume, velocity, and variety of data being created. However, its schema-less architecture does not mean that data modeling can be avoided.
Since data stored in Hadoop is almost always used for analytics, and because of its immense volume and wide variety, the need for data modeling has never been greater. As an example, if a micro-targeting strategy relies on personal data captured from a campaign website, then a dimensional data model should be used to dissect the data into facts such as website visits by potential supporters and dimensions such as location and time of their visits. This is necessary information for properly analyzing the data.

 

Furthermore, because of the immense volume and wide variety of Big Data, the need for quality metadata is at an all-time high. Metadata is critical for documenting important information such as the definitions, security policies, and stewardship of the data in Hadoop. For instance, if micro-targeting relies on IP addresses to determine potential candidate locations, the metadata will not only point out which field contains IP addresses, but it can also indicate that IP addresses combined with other data can potentially expose personally identifiable information (PII), which must be kept confidential. Moreover, if stewardship is documented in the metadata, the analyst can contact the data steward to learn more about the campaign’s policy on PII. There are different methods for storing metadata, but it’s often found that a data model is the best location for it. Watch this short video to learn how ER/Studio will allow you to easily create dimensional models and important metadata for Hadoop Hive.
{"video":"http://youtu.be/SQSiPwbdTlA","width":"undefined","height":"undefined"}
For more tips on handling unstructured data in your models, listen as Karen Lopez shares challenges and insights on data modeling for big data and NoSQL technologies, available on demand.
Want to learn more about ER/Studio? Try it for yourself free for 14 days!

About the author:
Rob Loranger is an Embarcadero Product Manager for the ER/Studio product family. Previous to his current role, Rob was a Sr. Software Consultant, and for more than 8 years he has been one of Embarcadero's leading experts for its database development, management, and architecture software.