By Cary Moore, Senior Director of Data Science


No cloud-driven data lake yet? Don’t worry… Avoid the stress and follow this guideline.

In 2019, it still seems a bit odd to be discussing the merits of a data lake. However, many organizations have been slow to adopt the cloud, and thus data lake implementation remains elusive – a distant strategy – with benefits yet to be realized. Many of my customers still have decision paralysis as they sort through the hype, reality, pros, cons, vendors, technology options, architecture, security, privacy, business cases, increasingly precious funding, and dizzying array of other endless concerns. The common theme of anxiety seems to be “deploy ‘big bang’ or build incrementally over time?”

Like any good consultant, the answer is: “it depends” – of course – on the use cases; that is, what is the purpose of the data lake? What business needs will it address?

In my experience, while there are a couple of foundational steps, there is huge benefit to deploying a cloud-based data lake – it’s hard to make a mistake you can’t recover from quickly.  So why the anxiety? There are several basic decisions that guide your strategy.

1. With the fear of stating the obvious, you need to pick a cloud provider.
Though which provider to choose can sometimes cause folks some hesitancy, as the fear of picking the wrong one is often their biggest concern. Fear not! With a minimal investment, you can start with any one of the major cloud providers. The initial investment shouldn’t be lost because switching costs, such as moving the data, porting over procedural logic, and machine learning algorithms are relatively low. Account setup and configuration and acquiring storage and processing capacity is similar across all of the major vendors. And, the experience is useful whichever provider you end up choosing for your data lake. So even if an established cloud strategy does not exist and/or has yet to select a cloud vendor, that should not prevent you from taking your first steps.

2. Security is necessarily a major concern.
No one wants their company to be the next data breach headline on any major news channel. For any company, the risk of exposing sensitive data could be both a financial and reputational catastrophe. So, be certain and follow preferred practices. When setting up your security services, as part of your cloud strategy, use your data management, InfoSec, internal audit, and compliance policies to assess the cloud vendor for potential exposures.

3. When establishing storage services, review roles, responsibilities, and access controls.
Establishing regular and frequent reviews of access controls can prevent unintended unauthorized access grants. These types of data governance processes reduce the risk of unaudited use of key company information.

4. The next big hurdle is deciding how to build the data lake.
“Big bang?” or incrementally? The latter is the obvious answer. But how? Time-to-value is everything – and the value should be driven by the prioritized business use cases that the data lake is intended to solve. Today, agile is the preferred delivery method – assuring each use case is realized in small, well-defined increments – building upon the foundation one step at a time.

5. Leverage the plethora of machine learning (ML) capabilities every cloud vendor should provide.
These rich open source libraries are necessary to find “nuggets of gold” in the oceans of data.  Leveraging the right automation to your ML strategy can dramatically increase the delivery velocity and result throughput.

So, to break it down:

  1. Pick a cloud vendor; remember, anyone will do but it’s best to know your long-term strategy.
  2. Know and apply InfoSec privacy and security protocols.
  3. Define the access grants for authorized users.
  4. Define the business use cases which drive the prioritization of the incremental data lake sourcing strategy to deliver value early and often.
  5. Apply automated ML to discern hidden patterns and trends which can drive your company’s business strategic plans.

The biggest benefit of having a data lake is the opportunity to rapidly address data-driven business requirements with as little cost and as fast as possible. As the capability becomes more mature, more governance and structure can be formalized and implemented. Don’t delay, start your data lake today.