Data quality is a critical aspect of any data warehouse project. With an ever increasing number of data sources and technologies, scoping and prioritization is daunting. In this document, we share our data quality framework for planning data quality actions, current state assessments, and prioritizing data quality initiatives.
A practical framework for data quality is often been overlooked as businesses focus on immediate value, with increased revenues and decreased costs.With data becoming increasingly integrated with strategy and operations, the bottom-line impact of data quality is increasing. Relatively newer corporate functions like sales operations and revenue operations use data heavily and with increasing depth. The professionals in these functions are very intentional about how they want to use data to generate value, and the data has to be accurate, timely, complete, and reliable. They rely on the data to make business decisions. They design processes and automate systems that use this data. Data quality has become a priority. Despite the increasing awareness of data quality, challenges still get in the way of data quality initiatives:Resources are limited.It is challenging to show the impact of data quality actions. The data sources, flows, and uses are flourishing, making it more challenging to prioritize data quality efforts.This document will share our data quality framework for planning data quality actions, current state assessments, and prioritizing data quality initiatives.Before we dive deeper, we want to remind the reader that our scope is the cloud data warehouse, where the data transformations are managed with a project, as is most commonly the case with dbt.
Data Warehouse projects have three layers with clear differences in requirements:
This three-layered approach helps us categorize the data quality actions and prioritize them. Let's look at them one by one.
If you are using a different data platform that does not offer secure data sharing, you can explore the possibility of the mobile app data vendor pushing the data into your cloud infrastructure from where it can be easily ingested into your data warehouse.
Ingestion is the point of data entry to the data warehouse. Different sources make the data available in various ways, such as transactional databases, transferred files, API calls, in-cloud data shares, etc.
The data that arrives in the data warehouse environment is source data. We have no control over the creation of the source data; it may be external or internal, but still from outside the data warehouse. Therefore, regardless of the delivery method or where the data is coming from, there are standard steps to ensure quality at data ingestion.
In this layer, we inspect the source data to ensure it is fresh, complete, and as expected. We can check expectations around shape, uniqueness, and size. We can also check content such as permitted values, means, averages, and distribution of measures.
Treating the source data as “immutable” and keeping it intact is critical. We treat the source data almost like the items in an evidence room. We need to be able to say: "This is how we received the data." We don’t transform the data at this stage; we keep it intact. This approach has multiple advantages. For example, the original data will be readily available in data transformation updates or bug fixes. Another example is uncovering issues with the source data. With the source data intact, it is easier to trace the issue to the source, communicate with the data owners, and resolve the source data issue.
The data quality activities in the ingestion step ensure the data quality going into the Data Warehouse.
After receiving the source data and checking it against our expectations, we build the data warehouse. Every view and table of our data warehouse is built afresh with the source data and our code that transforms it. Our data warehouse project (code) performs this step. With the 205 Data Lab approach (and most commonly in the modern data stack), this step is performed primarily by dbt.
The transformation step is quite different from the ingestion step. Unlike the source data, where we have no control over what was received, we control the transformation here. We run the transformation as code.
Since the transformation is code-based, we expect it to run the same way every single time. Once we adequately test that the code performing the data transformation works as expected, we deploy the code and do not expect errors of the sort we tested for. Occasionally we may encounter (if we are good, rarely) scenarios that we have yet to account for. We will change the code, test, and deploy in those cases. And we will not expect the same error again. The individual models (tables and views) will be accurate if the code is “tested and deployed.”
Therefore, the approach in this step is to have controlled deployment processes and practices and catch any bugs and errors early on, before deployment.
At the beginning of the transformation, there are the source models that we discussed above. At the end of the transformation, models will be consumed by other platforms, usually called "mart" models.
We will expose these models to reporting and visualization tools such as Looker (or Tableau, Mode, Sigma, or another). Alternatively, they could be used by machine learning models, operational analytics tools, or other applications that consume the transformed data.
For these models, similar to the source models, there are quality expectations at the destination. The requirements from the Data warehouse are similar: Operational and semantic checks on the data and ensuring what is promised is what is actually delivered.
We have outlined the three stages of the Data Warehouse with different data quality requirements: Ingestion, transformation, and delivery. Each step requires a different approach and a different set of tools. Below is our recommended checklist of data quality activities in each layer: ingestion, transformation, and delivery.
Please see below our data quality recommended checklist aligned with the three steps in the data warehouse projects.
|Staging Models for All Entities: Stage models with a unique_key tested for “not null” and “unique.||Dbt Style Best Practices Code Style (facilitated with a linter tool)||Standard Delivery Tests: Correctness of shape|
|Standard Ingestion Tests:
||Standard Transformation Tests:
||Anomaly Checks: Volume of data, distribution of KPIs (e.g., mean, average, standard deviance)|
|Snapshotted Source Data||Quality Control for new and existing models||Blue / Green Deployment|
|Anomaly Checks: Volume of data Distribution of KPIs (e.g., mean, average, standard deviance)||CI/CD guardrails: Version control PR template, PR checks before deployment (“Slim CI”)||Version-controlled BI tools’ best practices|
|Permission Management for differentiated access to raw, transformed, and mart data|
|Monitoring the transformation project|
|Alerts integrated with (email, slack)|
We are also including some of the visualization elements in the scope due to the significant interdependencies
There are an increasing number of tools available to improve data quality. We intend to capture our most up-to-date assessment based on our experience in this blog. With additional blog pieces to follow, we’ll continue expanding on these three stages by providing examples of different tools, best practices, and practical checklists we use.
By leveraging bulk data from providers like data.ai, B2B companies selling to mobile app publishers and developers can optimize their sales efforts and achieve scalable results. Implementing a data-driven approach, integrating it with CRM systems, and continuously refining the strategy based on feedback will ensure that your sales team targets high-potential apps and publishers, ultimately leading to increased revenue and success. By moving away from manual and inefficient processes, you'll be able to achieve greater consistency and effectiveness in your sales efforts.
Are you interested in exploring this approach for your company and need implementation help? With over five years of experience working with mobile app data, our team can help you implement this type of solution, even if you don't have a data warehouse or access to internal data resources. Start a conversation with us today, and let us help your team sell more effectively and efficiently!
A Go-To-Market (GTM) organization may find it beneficial to invest in a Data Warehouse (DWH) alongside their Salesforce CRM when they encounter specific indications that their data and analytics requirements have surpassed the analytics and reporting capabilities of Salesforce.Read Post
In modern data pipelines, continuous data integration and data frequent updates may cause unintended data errors. Blue-green Deployment ensures a final checkpoint for accurate results. In this blog, details are provided on how to implement blue green deployment in a data warehouse project.Read Post
Companies targeting mobile app publishers can use mobile app data to identify potential customers. This blog discusses using bulk mobile app data for sales outreach.Read Post
Discover how applied AI can automate tasks and enhance productivity in data analysis. Explore a real life example of Applied AI in report generation.Read Post
Stay in the loop with everything you need to know.
205 Data Lab is a data analytics and engineering firm that believes in the transformative power of data and technology and its ability to impact business outcomes positively.