Data Methodology
How Apiar Data selects, collects, normalises, and maintains the institutional data behind our signal coverage.
Source Selection
Data quality begins at the source. Our source selection criteria prioritise institutional provenance, methodological transparency, and licence compatibility.
We treat provenance as a first-class attribute. Every number on the platform can be traced to its origin.
Primary sources include:
- Official statistical agencies: Bureau of Labor Statistics, Office for National Statistics, Eurostat, Statistics Canada, and equivalents in 60+ countries
- International organisations: IMF, World Bank, OECD, BIS, UN statistical division
- Central banks: Federal Reserve, European Central Bank, Bank of England, and other major central banks
- Academic repositories: FRED (Federal Reserve Bank of St. Louis), CEPR, and peer-reviewed data publications
Collection Process
Automated Ingestion
Structured data from agency APIs, statistical portals, and machine-readable publications is collected via automated pipelines that run on defined schedules.
Manual Curation
Data from sources that lack structured APIs — PDFs, web tables, irregular releases — is ingested through semi-automated processes with human verification.
All ingested data is validated against schema checks before entering the platform. Validation errors trigger alerts for manual review. Data that fails validation is held in quarantine and not published until the issue is resolved.
Data Normalisation
Raw data from different sources arrives in incompatible formats: different time granularities, units of measurement, calendar conventions, and geographic classifications. Normalisation is the process of resolving these incompatibilities.
Our normalisation pipeline applies the following transformations:
- Date standardisation: All time stamps are converted to ISO 8601 format with explicit UTC offset
- Unit harmonisation: Base units are preserved; derived units (index values, ratios) are clearly labelled
- Geographic coding: Geographic identifiers are mapped to ISO 3166 country codes and UN M49 region codes
- Classification alignment: Where applicable, industry and product classifications are mapped to standard schemas (ISIC, HS, CPA)
Normalisation is applied non-destructively. Original source values are always preserved alongside normalised values.
Quality Assurance
We apply a multi-stage quality assurance process to all data before publication:
- Schema validation: Structural checks against expected data types, ranges, and formats
- Outlier detection: Statistical tests flag values that deviate significantly from historical series patterns
- Continuity checks: Sudden breaks in series values or coverage gaps trigger manual review
- Cross-source consistency: For series covered by multiple sources, automated checks compare values and flag significant divergence
- Manual review: Flagged series are reviewed by a team member before publication
Quality flags are displayed on dataset pages where relevant. Known issues, caveats, and data limitations are documented in the methodology notes for each dataset.
Update Frequency
Update schedules are set based on source release cadence. Where source agencies publish advance release calendars, we configure ingestion pipelines to run shortly after the scheduled release time. For agencies without advance calendars, we monitor sources continuously.
The “last updated” timestamp on each dataset page reflects the most recent data point, not the last time we checked the source.
Handling Revisions
Statistical agencies regularly revise previously published data — for seasonal adjustment, methodological improvements, or corrected source data. Our approach to revisions is:
- Revisions are applied to the live series automatically when detected
- The revision date and nature of the change are logged in the series metadata
- For significant revisions, a note is added to the dataset page explaining the change
- Vintage series (point-in-time snapshots) are maintained for datasets where revision history matters for analysis
We do not backfill revisions silently. All changes to published data are logged and queryable via the API.