2 min

How to structure data quality actions in your cloud data warehouse?

Data quality is a critical aspect of any data warehouse project. With an ever increasing number of data sources and technologies, scoping and prioritization is daunting. In this document, we share our data quality framework for planning data quality actions, current state assessments, and prioritizing data quality initiatives.

Applied AI

Data Analytics

Data Engineering

A practical framework for data quality is often been overlooked as businesses focus on immediate value, with increased revenues and decreased costs.With data becoming increasingly integrated with strategy and operations, the bottom-line impact of data quality is increasing. Relatively newer corporate functions like sales operations and revenue operations use data heavily and with increasing depth. The professionals in these functions are very intentional about how they want to use data to generate value, and the data has to be accurate, timely, complete, and reliable. They rely on the data to make business decisions. They design processes and automate systems that use this data. Data quality has become a priority. Despite the increasing awareness of data quality, challenges still get in the way of data quality initiatives:Resources are limited.It is challenging to show the impact of data quality actions. The data sources, flows, and uses are flourishing, making it more challenging to prioritize data quality efforts.This document will share our data quality framework for planning data quality actions, current state assessments, and prioritizing data quality initiatives.Before we dive deeper, we want to remind the reader that our scope is the cloud data warehouse, where the data transformations are managed with a project, as is most commonly the case with dbt.

How can we ensure data quality in a structured manner?

Data Warehouse projects have three layers with clear differences in requirements:

  • Data ingestion
  • Data transformation
  • Data delivery.

This three-layered approach helps us categorize the data quality actions and prioritize them. Let's look at them one by one.

If you are using a different data platform that does not offer secure data sharing, you can explore the possibility of the mobile app data vendor pushing the data into your cloud infrastructure from where it can be easily ingested into your data warehouse.

A Framework for data quality

Step 1: Ingestion

Ingestion is the point of data entry to the data warehouse. Different sources make the data available in various ways, such as transactional databases, transferred files, API calls, in-cloud data shares, etc.

The data that arrives in the data warehouse environment is source data. We have no control over the creation of the source data; it may be external or internal, but still from outside the data warehouse. Therefore, regardless of the delivery method or where the data is coming from, there are standard steps to ensure quality at data ingestion.

In this layer, we inspect the source data to ensure it is fresh, complete, and as expected. We can check expectations around shape, uniqueness, and size. We can also check content such as permitted values, means, averages, and distribution of measures.

Treating the source data as “immutable” and keeping it intact is critical. We treat the source data almost like the items in an evidence room. We need to be able to say: "This is how we received the data." We don’t transform the data at this stage; we keep it intact. This approach has multiple advantages. For example, the original data will be readily available in data transformation updates or bug fixes. Another example is uncovering issues with the source data. With the source data intact, it is easier to trace the issue to the source, communicate with the data owners, and resolve the source data issue.

The data quality activities in the ingestion step ensure the data quality going into the Data Warehouse.

Step 2: Transformation

After receiving the source data and checking it against our expectations, we build the data warehouse. Every view and table of our data warehouse is built afresh with the source data and our code that transforms it. Our data warehouse project (code) performs this step. With the 205 Data Lab approach (and most commonly in the modern data stack), this step is performed primarily by dbt.

The transformation step is quite different from the ingestion step. Unlike the source data, where we have no control over what was received, we control the transformation here. We run the transformation as code.

Since the transformation is code-based, we expect it to run the same way every single time. Once we adequately test that the code performing the data transformation works as expected, we deploy the code and do not expect errors of the sort we tested for. Occasionally we may encounter (if we are good, rarely) scenarios that we have yet to account for. We will change the code, test, and deploy in those cases. And we will not expect the same error again. The individual models (tables and views) will be accurate if the code is “tested and deployed.”

Therefore, the approach in this step is to have controlled deployment processes and practices and catch any bugs and errors early on, before deployment.

Step 3: Delivery

At the beginning of the transformation, there are the source models that we discussed above. At the end of the transformation, models will be consumed by other platforms, usually called "mart" models.

We will expose these models to reporting and visualization tools such as Looker (or Tableau, Mode, Sigma, or another). Alternatively, they could be used by machine learning models, operational analytics tools, or other applications that consume the transformed data.

For these models, similar to the source models, there are quality expectations at the destination. The requirements from the Data warehouse are similar: Operational and semantic checks on the data and ensuring what is promised is what is actually delivered.

We have outlined the three stages of the Data Warehouse with different data quality requirements: Ingestion, transformation, and delivery. Each step requires a different approach and a different set of tools. Below is our recommended checklist of data quality activities in each layer: ingestion, transformation, and delivery.

Our recommended checklist:

Please see below our data quality recommended checklist aligned with the three steps in the data warehouse projects.

Ingestion Transformation Delivery
Staging Models for All Entities: Stage models with a unique_key tested for “not null” and “unique. Dbt Style Best Practices Code Style (facilitated with a linter tool) Standard Delivery Tests: Correctness of shape
Standard Ingestion Tests:
  • No nulls for required columns
  • Correctness of data types
  • Recency and Freshness tests
  • Correctness of shape (expected columns are there)
Standard Transformation Tests:
  • Unique key for every model
  • Not null and unique tests for every unique key
  • Column value assertions: relationships, accepted value for booleans, and category variables, not null tests (where required)
Anomaly Checks: Volume of data, distribution of KPIs (e.g., mean, average, standard deviance)
Snapshotted Source Data Quality Control for new and existing models Blue / Green Deployment
Anomaly Checks: Volume of data Distribution of KPIs (e.g., mean, average, standard deviance) CI/CD guardrails: Version control PR template, PR checks before deployment (“Slim CI”) Version-controlled BI tools’ best practices
Permission Management for differentiated access to raw, transformed, and mart data
Monitoring the transformation project
Alerts integrated with (email, slack)
Documentation

We are also including some of the visualization elements in the scope due to the significant interdependencies

To be continued:

There are an increasing number of tools available to improve data quality. We intend to capture our most up-to-date assessment based on our experience in this blog. With additional blog pieces to follow, we’ll continue expanding on these three stages by providing examples of different tools, best practices, and practical checklists we use.

Stay tuned!

Conclusion

By leveraging bulk data from providers like data.ai, B2B companies selling to mobile app publishers and developers can optimize their sales efforts and achieve scalable results. Implementing a data-driven approach, integrating it with CRM systems, and continuously refining the strategy based on feedback will ensure that your sales team targets high-potential apps and publishers, ultimately leading to increased revenue and success. By moving away from manual and inefficient processes, you'll be able to achieve greater consistency and effectiveness in your sales efforts.

Are you interested in exploring this approach for your company and need implementation help? With over five years of experience working with mobile app data, our team can help you implement this type of solution, even if you don't have a data warehouse or access to internal data resources. Start a conversation with us today, and let us help your team sell more effectively and efficiently!

Don’t Miss Out On Future Articles

Stay in the loop with everything you need to know.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.