Understanding the Role of Governance in Data Lakes and Warehouses

Originally published on Information-Management.com
Written by Annette Wright, Senior Director of Analytics and Governance

Data lakes and data warehouses are both used to store data. And while they have innate differences, and serve organizations differently, there is a universal thread that runs through both, without which, would render them useless – data governance.

Data Lakes are repositories of data that can be structured or unstructured and can contain traditional transaction-type data, phone logs – you name it! It is truly a repository of all types of organizational data.

With data lakes, data can be brought in quickly, without complex provisioning, and there is no time spent on how it relates or should interact with other data sitting in the lake. It should be kept as close to its raw form as possible so that it can be used in multiple functions and isn’t locked into a particular use. Because all data is available, it allows for much deeper analytics.

Data lakes allow more flexibility for what-if analysis and modeling to identify relationships and likely outcomes that may not have been as obvious, such as with Market Basket analysis. With data scientists able to quickly access more information to identify such obscure relationships, companies can use that information to in turn better service customers.

At the same time, it allows for identification of negative indicators which can help to protect the business and identify risks early on so they can be mitigated.

A key example of this comes from a regulatory perspective. A key regulatory metric for reporting is probability of default – in which models are built out to calculate the probability of default for different classifications of customers (whether based on geographic location, credit limit, etc.).

With a wide range of factors used in the model, data lake analytics can provide access to more data more quickly, greatly increasing the accuracy of the models. This in turn allows organizations to better serve their clients and provides them the insight to possible risks early on so that they can be mitigated.

Data warehouses

Data warehouses are structured data sets that include both current and historical data. They are structured in a manner to meet reporting or analytical requirements. Creating a “single source of truth” for multiple reporting and analytical requirements reduces risk of inconsistent and inaccurate reporting across the enterprise.

Data warehouses bring data together in a structured way – it is modeled and set up in physical structures via a set of requirements, with performance and capture of consistent data relationships being the key goal.

Data warehouses are used to consolidate the source of data, allowing everything to run into the same tables via a common set of domains/definitions. There can be 1 or 20 sources, but it will all be presented for use under a set of business-defined and understood domains for organizational purposes.

Having data well organized and consistently aggregated allows for the creation of performance and operational metrics – reporting that drives business and allows leaders to make informed decisions. Inclusion of both historical and current information organized in a consistent manner within the data warehouse increases the quality of the viewed data, thus increasing decision-making quality.

A key example of this can be seen in seasonality. Operational metrics pulled from data warehouses can help identify times of the year that see more activity than others, think holidays, etc.

This historical analysis can guide staffing needs and what information is given to merchants, as well as indicate that customer should know this is a higher activity time. It can also impact IT decisioning – new systems shouldn’t be implemented in the middle of a holiday rush. The metrics identified from data warehouse information can impact decisions across the entire organization.

Data governance – The common thread between the data stores

Although they are different, the key to successful data lakes and data warehouses with useful, quality data, is the same – governance. Data dovernance allows for the understanding of not only what is stored where and its source, but the relative quality of the data and being able to ascertain it consistently.

Aside from clarity and structure, governance also allows control. With such control, the organization knows how the data is being used and whether or not it’s meeting its intended purpose.

Say the data has been manipulated to meet a set of determined requirements, without data governance, someone else could come along and pull the data – not knowing it had been previously employed – thus resulting in an inaccurate data analysis.

Essentially, governance is the key to maintaining transparency over what data is available, how data is available, what data should be used, and who should or should not be using it. It serves as the glue ensuring both data stores are being utilized appropriately.

Whether or not a company employs a data lake, data warehouse or both, it’s imperative that said data is governed appropriately. While both data stores provide beneficial insights that can help lead an organization, affecting consumers all the way to the bottom line, without a data governance framework to control and guide the two, the wealth of data supported by both may never live up to the transformative potential they carry.