How to measure data timeliness, freshness and staleness metrics (2024)

Last updated: August 21, 2024

Read this guide to learn how to measure data timeliness metrics, such as freshness (the most recent data) or staleness (when the data was loaded).

The timeliness data quality checks are configured in the timeliness category in DQOps.

Data timeliness

Timeliness is a measure that tells us the lag between something happened and when it was recorded.When we apply timeliness to data, we come up with the concept of data timeliness.Data timeliness uses the most recent timestamps in the dataset to calculate the time lag.

The formula that calculates data timeliness compares two timestamps and measures the time difference between them: the data lag.

  • The first value is the most recent timestamp in the dataset. It is the point in time when SOMETHING happened. It can be a business action, such as the timestamp of the most recent transaction in an eCommerce platform, the most recent impression of an advertisem*nt, the timestamp of a log entry, or the timestamp of the last shipment.

  • The second value is the system's current time. It is the timestamp when we recorded the state of the data in the dataset.

The only challenge here is to identify the right timestamp column in the dataset that can tell us about the currency of the data.

Data freshness

The most relevant timeliness metric is data freshness.It is the delay since the last business action recorded in the database.

The most common business actions are:

  • The transaction timestamp in eCommerce.

  • The timestamp of registering the last user in a user's database.

  • The impression timestamp of digital ads.

  • The invoice issue timestamp.

  • The timestamp of the log entry if we are aggregating logs.

The data freshness measures the time lag since one of these business actionsand the timestamp when a data observability platform such as DQOps analyzedthe table and found the most recent business action.

The following table shows a list of potential orders in an eCommerce platform.

order_idcustomer_idcreated_at
10052023-11-12 09:15:00
10172023-11-13 08:40:00
10242023-11-14 10:25:00<- the most recent transaction

Let's assume the current system time when the data freshness calculation was performed was2023-11-16 11:45:00 (November 16th, 2023 11:45 AM).We want to calculate the lag (time difference) between the most recent business action andthe timestamp when we tested data freshness.

order_idcustomer_idcreated_atData freshness formula: NOW() - created_atData lag
10052023-11-12 09:15:00"2023-11-16 11:45:00" - "2023-11-12 09:15:00"4 days 2h 30m
10172023-11-13 08:40:00"2023-11-16 11:45:00" - "2023-11-13 08:40:00"3 days 3h 5m
10242023-11-14 10:25:00"2023-11-16 11:45:00" - "2023-11-14 10:25:00"2 days 1h 20m

We don't need to calculate the data lag for every record in the table.We only need to know the timestamp of the most recent business transaction.DQOps finds the most recent timestamp using a MAX function, which is an aggregate function in SQL.

order_idcustomer_idcreated_atData freshness formula: NOW() - MAX(created_at)Data lag
10242023-11-14 10:25:00"2023-11-16 11:45:00" - "2023-11-14 10:25:00"2 days 1h 20m

The data freshness of this table is 2 days 1 hour 20 min. If we measure the lag numerically, that would be 2.05555 days.

Data staleness

Because data timeliness measures the currency of a copy of data stored in a database,we must distinguish the timestamps of business actions and the time when the data pipeline (ETL process)loaded the data into the data warehouse or data lake.

A well-designed data pipeline augments the target table by adding additional columnsthat identify the data source and the time of inserting the data.If the data pipeline adds additional metadata columns such as inserted_at or loaded_at,they will be populated by the data pipeline with the timestamp when the data was stored in the database.These columns become our next reference point in time. They store the time when the data was most recently loaded into the database.We use the name data staleness to refer to the time lag since the last refresh of the data warehouse.

Let's extend the previous example of the orders table. The table will have an additional column named loaded_at,which holds the timestamp when the data was loaded into the database.If the data pipeline loaded the data incrementally, the loaded_at timestamps would differ for each row.But if our data pipeline performs a full table refresh every time, the timestamp of loading rows would change.

order_idcustomer_idcreated_atloaded_at
10052023-11-12 09:15:002023-11-12 11:00:00
10172023-11-13 08:40:002023-11-13 11:00:00
10242023-11-14 10:25:002023-11-14 11:00:00

The formula for calculating the data staleness compares the time between the most recent loaded_at timestampand the point in time of checking the data staleness.

order_idcustomer_idcreated_atloaded_atData staleness formula: NOW() - loaded_atData lag
10052023-11-12 09:15:002023-11-12 11:00:00"2023-11-16 11:45:00" - "2023-11-12 11:00:00"4 days 0h 45m
10172023-11-13 08:40:002023-11-13 11:00:00"2023-11-16 11:45:00" - "2023-11-13 11:00:00"3 days 0h 45m
10242023-11-14 10:25:002023-11-14 11:00:00"2023-11-16 11:45:00" - "2023-11-14 11:00:00"2 days 0h 45m

We also only need one record, the most recently loaded one, to calculate the data staleness.It is enough that a data observability platformsuch as DQOps finds the most recent value using a MAX aggregate function.

order_idcustomer_idcreated_atloaded_atData staleness formula: NOW() - MAX(loaded_at)Data lag
10242023-11-14 10:25:002023-11-14 11:00:00"2023-11-16 11:45:00" - "2023-11-14 11:00:00"2 days 0h 45m

The data freshness of this table is 2 days 0 hours 45 min. If we measure the lag numerically, that would be 2.03125 days.

Ingestion delay

The time between when the business action happened and the time the record was loaded is called the ingestion delay.It is the time duration when the record was being processed by the data pipeline or the delay between each run of the pipeline.

DQOps finds the two most recent timestamps and compares them.

  • The most recent timestamp of a business transaction.

  • The most recent timestamp of loading the record.

order_idcustomer_idcreated_atloaded_atData ingestion delay formula: MAX(loaded_at) - MAX(created_at)Data lag
10242023-11-14 10:25:002023-11-14 11:00:00"2023-11-14 11:00:00" - "2023-11-14 10:25:00"0 days 1h 20m

The ingestion delay for the dataset shown above is 1 hour 20 minutes.

When applied to date-partitioned data, the ingestion delay becomes a more critical timeliness measure.Tables that are physically partitioned by date can benefit from more efficient incremental data load by refreshing whole partitions.

The partition checks in DQOpscalculate the ingestion delay for each daily or monthly partition, allowing for tracking when partitions were loaded or reloaded.

Freshness, staleness and ingestion delay compared

All timeliness metrics supported by DQOps are compared on the following time diagram.

Configure DQOps for timeliness checks

Before activating any timeliness data quality checks in DQOps, you must enter additional configuration about the monitored table.DQOps must know which columns in the table store relevant timestamp values. DQOps stores the configurationof these columns for each table in the table's YAML file. The section is named timestamp_columns.

You have to configure one or both of the column names:

  • The event_timestamp_column parameter stores the name of the column containing the most recent business action.

  • The ingestion_timestamp_column parameter stores the name of the column containing the timestamp when the row was loaded into the target table.

Configuring timestamp columns from UI

To configure the event and/or ingestion timestamp columns:

  1. Go to the Data Sources section.

  2. Select the table of interest from the tree view.

  3. Select the Data and Time Columns tab and select a column from the drop-down list in the "Event timestamp column name for timeliness checks" and/or "Ingestion timestamp column name for timeliness checks" input fields.

  4. Click the Save button in the upper right corner.

Configuring timestamp columns in YAML

The event and ingestion timestamps for timeliness checks can be also configured by addingthe appropriate parameters to the table's YAML configuration file.

Below is an example of the YAML file showing a sample configuration with set event timestamps (the business action) column event_timestamp_column,ingestion timestamps (the loaded_at or similar) column ingestion_timestamp_column.

123456
apiVersion: dqo/v1kind: tablespec: timestamp_columns: event_timestamp_column: created_at ingestion_timestamp_column: loaded_at

Enabling table freshness monitoring

The data_freshness check monitors the freshness of the table.The best way to track data freshness is by using a daily monitoringvariant of this check, named daily_data_freshness.

The data_freshness checkcaptures the delay since the timestamp of the most recent business action that we call the event timestamp in DQOps.

The data quality rules for this check take one parameter max_days, which configures the maximum accepted lag in days.DQOps supports setting this parameter at multiple severity levels.We advise configuring the check for warning and error severity levels.

  • When the lag exceeds the maximum number of days for the warning threshold, DQOps will raise a data quality issue with a warning severity level. The current lag should be reviewed, and the configuration of the check should be adapted to avoid too many false-positive alerts.

  • When the lag exceeds the maximum number of days for the error threshold, DQOps will raise a data quality issue with an error severity level. These issues should be reviewed and reported to the data engineering team or the data owner.

Configuring table freshness in UI

The data timeliness checks are configured on the table level, as shown in the following screenshot. Data freshness is a standard check.If you want to display additional data timeliness checks, select the Show advanced checks checkbox.

Configuring table freshness in YAML

The configuration of data_freshness is straightforward in YAML.

An error is raised when the table freshness lag exceeds 1.5 days.

 1 2 3 4 5 6 7 8 9101112
# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.jsonapiVersion: dqo/v1kind: tablespec: timestamp_columns: event_timestamp_column: created_date monitoring_checks: daily: timeliness: daily_data_freshness: error: max_days: 1.5

Tracking data timeliness on dashboards

DQOps has several built-in data quality dashboards dedicated to data timeliness.They allow to track and review all types of timeliness issues from multiple angles.

  • The Current timeliness issues shows the most recent data timeliness issues. If the missing data is received and the current data timeliness is below the allowed lag, DQOps does not show the older issues on the dashboard.

  • The History of timeliness issues dashboard also shows older timeliness issues, allowing a review of the reliability of the timeliness dimension for tables.

  • The Table freshness - tables with most current data shows the tables containing the most recent business actions.

  • The Table freshness - tables with the oldest data shows tables that contain outdated data. These tables do not receive new records.

  • The Table staleness - tables most recently loaded shows the tables most recently loaded, identified by the data staleness check.

  • The Table staleness - tables no longer loaded shows tables not refreshed for a long time, sorted from the oldest. These tables should probably be retired, or a data pipeline stopped working and is not feeding new data.

  • The History of table freshness dashboard shows the data freshness lags for each day in a pivot table. This dashboard helps to understand when the tables most likely receive new data.

  • The Minimum, maximum, average delay dashboard shows the statistics about the data processing lag.

Data freshness dashboard

The Table freshness - tables with most current data dashboard shows the tables with the most recent data,but also shows data freshness issues for outdated tables.

History of timeliness issues

The History of timeliness issues dashboard shows an aggregated list of recent data timeliness issues.

List of timeliness checks at a table level

Data quality check nameFriendly nameData quality dimensionDescriptionStandard check
data_freshnessData freshness (Maximum age of the most recent row)TimelinessA table-level check that calculates the time difference between the most recent row in the table and the current time. The timestamp column that is used for comparison is defined as the timestamp_columns.event_timestamp_column on the table configuration. This check is also known as "Data Freshness".
data_freshness_anomalyData freshness anomaly (Abnormal delay in data delivery)TimelinessThis check calculates the most recent rows value and the current time and detects anomalies in a time series of previous averages. The timestamp column that is used for comparison is defined as the timestamp_columns.event_timestamp_column on the table configuration. It raises a data quality issue when the mean is in the top anomaly_percent percentage of the most outstanding values in the time series. This data quality check uses a 90-day time window and requires a history of at least 30 days.
data_stalenessData staleness (Maximum number of days since the recent ingestion)TimelinessA table-level check that calculates the time difference between the last timestamp when any data was loaded into a table and the current time. This check can only be use when a data pipeline, ETL process, or trigger in the data warehouse is filling an extra column with the timestamp when the data loading job was loaded. The ingestion column used for comparison is defined as the timestamp_columns.ingestion_timestamp_column on the table configuration. This check is also known as "Data Staleness".
data_ingestion_delayData ingestion delay (Maximum number of days between the last record has been created and loaded)TimelinessA table-level check that calculates the time difference between the most recent row in the table and the most recent timestamp when the last row was loaded into the data warehouse or data lake. To identify the most recent row, the check finds the maximum value of the timestamp column that should contain the last modification timestamp from the source. The timestamp when the row was loaded is identified by the most recent (maximum) value a timestamp column that was filled by the data pipeline, for example: "loaded_at", "updated_at", etc. This check requires that the data pipeline is filling an extra column with the timestamp when the data loading job has been executed. The names of both columns used for comparison should be specified in the "timestamp_columns" configuration entry on the table.
reload_lagReload lag (Maximum delay to load the last record for each partition)TimelinessA table-level check that calculates the maximum difference in days between ingestion timestamp and event timestamp values on any row. This check should be executed only as a partitioned check because this check finds the longest delay between the time that the row was created in the data source and the timestamp when the row was loaded into its daily or monthly partition. This check detects that a daily or monthly partition was reloaded, setting also the most recent timestamps in the created_at, loaded_at, inserted_at or other similar columns filled by the data pipeline or an ETL process during data loading.

Reference and samples

The full list of all data quality checks in this category is located in the table/timeliness reference.The reference section provides YAML code samples that are ready to copy-paste to the .dqotable.yaml files,the parameters reference, and samples of data source specific SQL queries generated by data quality sensorsthat are used by those checks.

What's next

  • Learn how to run data quality checks filtering by a check category name
  • Learn how to configure data quality checks and apply alerting rules
  • Read the definition of data quality dimensions used by DQOps
How to measure data timeliness, freshness and staleness metrics (2024)
Top Articles
Family Law Cases - JustFacts
How to Invest 3,000 Dollars: 14 Best Options in 2024 • Benzinga
NOAA: National Oceanic &amp; Atmospheric Administration hiring NOAA Commissioned Officer: Inter-Service Transfer in Spokane Valley, WA | LinkedIn
The Largest Banks - ​​How to Transfer Money With Only Card Number and CVV (2024)
Www.craigslist Virginia
Ub Civil Engineering Flowsheet
Steve Strange - From Punk To New Romantic
Wgu Admissions Login
ᐅ Bosch Aero Twin A 863 S Scheibenwischer
Overton Funeral Home Waterloo Iowa
Unterwegs im autonomen Freightliner Cascadia: Finger weg, jetzt fahre ich!
Stardew Expanded Wiki
Vandymania Com Forums
ZURU - XSHOT - Insanity Mad Mega Barrel - Speelgoedblaster - Met 72 pijltjes | bol
Marine Forecast Sandy Hook To Manasquan Inlet
Engineering Beauties Chapter 1
Wiseloan Login
104 Presidential Ct Lafayette La 70503
The Banshees Of Inisherin Showtimes Near Broadway Metro
Mta Bus Forums
Bolly2Tolly Maari 2
Section 408 Allegiant Stadium
Gopher Hockey Forum
A Man Called Otto Showtimes Near Carolina Mall Cinema
Things to do in Pearl City: Honolulu, HI Travel Guide by 10Best
Plasma Donation Racine Wi
"Pure Onyx" by xxoom from Patreon | Kemono
Mkvcinemas Movies Free Download
Kagtwt
No Hard Feelings Showtimes Near Tilton Square Theatre
AP Microeconomics Score Calculator for 2023
The Mad Merchant Wow
Property Skipper Bermuda
When His Eyes Opened Chapter 2048
2020 Can-Am DS 90 X Vs 2020 Honda TRX90X: By the Numbers
Indio Mall Eye Doctor
Lcwc 911 Live Incident List Live Status
Sas Majors
How to Print Tables in R with Examples Using table()
Ezpawn Online Payment
Conan Exiles Armor Flexibility Kit
Charli D'amelio Bj
Expendables 4 Showtimes Near Malco Tupelo Commons Cinema Grill
60 Days From May 31
Borat: An Iconic Character Who Became More than Just a Film
Worland Wy Directions
A jovem que batizou lei após ser sequestrada por 'amigo virtual'
Workday Latech Edu
40X100 Barndominium Floor Plans With Shop
Windy Bee Favor
Tanger Outlets Sevierville Directory Map
The Significance Of The Haitian Revolution Was That It Weegy
Latest Posts
Article information

Author: Ouida Strosin DO

Last Updated:

Views: 5862

Rating: 4.6 / 5 (56 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Ouida Strosin DO

Birthday: 1995-04-27

Address: Suite 927 930 Kilback Radial, Candidaville, TN 87795

Phone: +8561498978366

Job: Legacy Manufacturing Specialist

Hobby: Singing, Mountain biking, Water sports, Water sports, Taxidermy, Polo, Pet

Introduction: My name is Ouida Strosin DO, I am a precious, combative, spotless, modern, spotless, beautiful, precious person who loves writing and wants to share my knowledge and understanding with you.