By Cary Moore, Senior Director of Data Science


For the last decade (at least), so much of what I’ve seen and read in the marketplace, research, literature, and press is that the data lake and data scientists simplified end-user report environments, and the cloud will inevitably replace all that was wrong with the Enterprise Data Warehouse (EDW); like it was some evil imperial monolith from the “Dark Side” to be destroyed.

Reasons: Over-budget, over-complicated, often late, missed requirements, stakeholder engagement, and every other criticism you can think of for failed missions. Of course, many of these circumstances were true. Every EDW I’ve ever built has had its challenges – just like any other project – but, fortunately, most were successful. Invariably, there were always a few that proved to be more challenging than others, and customer satisfaction waned during the engagement. “Hindsight is 20/20” – If you only knew what you needed to know before the work started, then even those more difficult projects may have been (more) successful. There is always something hiding under a rock that you couldn’t see or didn’t know existed, and neither did the customer until you dug it up.

Those hidden rocks, the unknown, unknown unknowns, and all of the data problems, business processes, systemic issues, and so on, would have existed in any other situation. It’s the data issues that are the most difficult to resolve. Essential to the success of any solution approach is adding meaning (read: value) to data. Critical data must be documented, not just for its definition, structure, provenance, etc., but for its use; that is, how data is consumed by key decision-makers. Oftentimes, simply identifying what’s required, formulas, correct source, aggregation, filtering, is a painstaking process – and there is no baby Yoda with the “Force” – that can replace the necessity of knowing the business.

I’m not writing this article for the initiated, many of whom know this all too well. I’m trying to help those that are struggling, trying to deliver that all-important data lake. Point is, just because you have the greatest technical wizardry money can buy – cloud or otherwise – the best intentioned, well-architected, supreme solution – will not address gathering requirements, including defining the data. This is not a technical task, it’s a hunting expedition, and without a Mandalorian on commission, you are only successful if you understand the business. To understand the business, you must speak to many people, ask the right questions, get the complete answers and confirm them. I’ve recently seen articles discussing that young Padawan data scientists are not enjoying their roles because much of their time is spent simply getting data and requirements, not running models; this should not come as a surprise.

Simple things can trip up even the best analysts, architects, and developers. For example, a seemingly simple concept like “current balance” at a financial services company can easily become a multi-headed monster as different business units, application teams, and reporting areas define the term for their own use. Reconciling the varying business rules, data requirements, and consumption architecture is a challenge for even the most experienced architects (read Jedi). Legacy applications also present their own unique challenges. I’ve had systems that had the date in 4 fields (columns), one for the century, year, month and day, in integers, with names like “excy” and “efdy.”  I’m sure you will figure out what those mean just by looking at them. Success requires a persistent analyst to locate the SME and confirm all of the rules and the corresponding result and document it (data mapping). Of course, in today’s ever-changing business world, understanding the history of how the data came to be in its current incarnation is also critical. “Incorrect” data from systems changing over time required a series of filters to remove historical data, uncovered through many iterations. As we have all seen, it’s a rare team that maintains good documentation with enough detail and history assuming it has been captured before, let alone new systems and sources. Often it is only the availability of the long-standing members that keep tribal knowledge available. Hopefully, they all aren’t located on the planet Alderaan.

Previous technologies were not perfect and much of what we do with data is infinitely easier.  ETL and previous relational EDWs had their challenges. ETL is too manual, time-consuming, hard to change or version. On-premise EDWs and their schemas are difficult to design and change too, requiring expensive dedicated hardware and support staff. The newer tools help in those areas; reducing the amount of dedicated hardware and processing times, lessen the impact of changes and allowing more efficient validation and testing activities. Machine learning and open source software are significantly changing what you can do with data and democratizing what previously was only available to specialists. Technology is still woefully incapable of creating meaning. Producing flatter tables and curated data sets with all the necessary detail lend great assistance to the consumers of that content encouraging more self-service and reducing effort overall. Whether seasoned data scientists or newbie report developers, all will benefit from data with the understanding and meaning built into them.  Without a centralized process to do that beforehand, you will simply cause many different people to require the same time of the same SMEs asking the same questions.

Call it data governance, data quality, data cleansing, metadata management, data architecture, data catalogs or data dictionaries – for me, the problem continues to be the same exercise just with a different name. I call it “understanding your data” by adding meaning to it. It will be a while before that skillset goes away or becomes automated by technology. Data is still vastly a people business no matter how fast or fancy the technology and lacking Jedi mind tricks… May the force be with you.