Recent studies show that over 80% of Data Warehouse projects ultimately fail. This can be due to multiple reasons like not focussing on delivering business value, or treating them as purely technology projects. To put into perspective how challenging these projects can be, it has been quoted in the past that only desktop deployed software projects had a lower success rate than Data Warehousing. This is primarily because every part of acquisition and preparation (modeling and transformations) can have its own unique set of problems.
When starting a Data Warehousing project, many BI Solution providers offer solutions that remove Data Warehousing as a crucial step and as a result end up ignoring or delaying addressing the challenges of data acquisition and modeling. This will often leave customers with a solution that is not scalable in terms of data volumes, refreshes, or expansion into additional insights. Customers should instead consider a multi-faceted approach that builds on the foundation of quality data pipelines that routinely refresh data and a curated layer that models and prepares the data to answer many questions and can be used as the input for BI Solutions that your organization will build vs a pointed data set that can only answer one question.
With a growing number of data sources, and a significant increase in variety, types and volume of data to be ingested, the complexity of Data Acquisition has increased. The increase in sources has led to unique challenges in modeling. In the past, organizations stored customer data in multiple systems and were able to manage their business to some degree reporting off of source systems. However, this number of unique systems has increased exponentially in recent years and with easy access to SaaS platforms that capture customer data and interactions more accurately and in larger volumes. This complexity increase on both fronts has led to a steady decline in quality of data in customer platforms overall and these challenges paired with a rapidly changing tool environment have made a complex problem even more daunting.
The move from ETL to ELT and the switch from a traditional Data Warehouse to Gartner’s Concept of Logical Data Warehouse combines traditional Data Warehouses and a Data Lake which while making the overall deployment more complex, simplifies the implementation of pipelines and curation logic and subsequently increases the success rate of these projects. The implementation of ELT allows a clear split between extracting and preparing data. Whether in a Data Lake, Lakehouse, or traditional Data Warehouse we can now land data, in all its forms, in one place and then the transformations can take place in a platform that is designed to work with large sets of data of large volumes quickly and efficiently. After landing we can model and transform the data through multiple layers, allowing us to go to market faster with more flexibility and reduced re-engineering on the addition of new sources.
Traditional Data Warehouses contain business logic embedded in their structures especially in Reference Architectures such as a Kimball style warehouse that are not only complicated to load, but also can lead to a great deal of re-engineering when adding additional sources to conformed dimensions for example. Modern warehousing approaches such as Data Vault isolate this logic into multiple tiers allowing teams to move faster and more accurately through data preparation. Putting all this together while the overall approach is yielding higher success the process can still be complex, that said there are some best practices to keep in mind as well as some anti-patterns to avoid:
DO: Address Challenges in Acquiring and Processing Data
- Start Small: Use a POC to ensure that the pattern selected works for both your data acquisition pipeline as well as your overall warehousing strategy
- CDC: Develop a CDC Strategy that allows ongoing frequent refreshes of the data in your warehouse
- Track History: Bill Inmon has updated his definition of the data warehouse to include “over time” plan on keeping historical data in the warehouse in a easily manageable manner
- Abstract Logic: Just as you do not want to “big bang” an environment you do not want to do this with your business logic you apply to make the data fit for use either
- Create a pattern: Templatize your integration logic so that the process is auditable and repeatable
- Make a Decision: For things that could go either way (naming conventions, date formats, use of time zones, etc.) make a decision and stick with it regardless of what it is, just be consistent
- Plan for Promotions: Having multiple environments and a promotion process needs to be in place for production but not POC, this does not however mean we do not need to focus on it during the POC
- Prepare to be a Partner: Unless you control your platform completely, make friends with the security and infrastructure teams to make standing up your POC easy and plan on taking the solution to production
DO NOT: Trap yourself with A Non-Scalable Solution that Does Not Grow with your Business
- Take on Too Much: People will often push for a POC to include multiple sources or replace a key report used by a wide audience; these initiatives are difficult to deliver on due to complexity/large number of stakeholders
- Spend Time on Ancillary Processes: Don’t forgo a POC that delivers value to implement the perfect data catalog up front
- Fall into Common Traps: Keep an eye for one time loads, load QA data with no path to prod, not plan for how promotions will occur.
- Sit in “Analysis Paralysis”: There are a lot of tools, platforms, and methodologies that you need to move fast and deliver value before business partners lose interest
- Forgo a Backlog: Once you deliver your POC you will need to have a plan to keep delivering, pushing a first use case with no additional planned work is a good way to lose buy in from the user base
Demystifying Data With AllCloud
For anyone starting on a data acquisition or modeling journey, consistency and efficiency are key. Picking a standard and sticking to it, moving quickly through tool selection and delivering a POC is crucial, and ensuring that there is a planned backlog to keep stakeholders engaged is crucial to the success of a new data program, a stagnant data program, or one looking to modernize. The focus in the POC should be to avoid creating future technical debt through the dev, QA, and production lifecycle or in the design of the jobs themselves; this should ensure that the team is prepared to work with multiple databases, filesets, and API sources, utilizing a common pattern for data ingestion that works with the full array of data sources.
Ready to get started on your data warehouse journey? Talk to our experts today!