Data Quality Methodology

Data Quality Methodology#

This page is where Douro Data differs most from the rest of the market, so it gets real depth.

The pipeline#

Every hospital in the catalog goes through the same four stages:

Fetch — we harvest the hospital's machine-readable price file from its published location.
Parse — the file is parsed faithfully. The parser transcribes what the hospital published; it does not infer, repair, or normalize values.
Quality gate — records must clear a strict quality gate before publication, and detectable constant-value patterns are annotated at the row level.
Publish — gated, annotated data lands in the Marketplace tables, stamped with LATEST_INGEST_DATE (our load date).

The never-suppress principle#

We never alter or delete a published value. If a hospital's file contains a $0.01 flood or a repeated magic number, those rows ship — flagged. This matters because deletion destroys evidence: you can't audit what isn't there, and a dataset that silently drops inconvenient rows can't tell you what its publisher decided you shouldn't see. We annotate, you choose.

AUDIT_FLAG: two grades, precisely#

AUDIT_FLAG is a nullable column on NEGOTIATED_RATES with exactly two values. The two grades are different kinds of statements, and the difference is the point:

CONSTANT_VALUE_FILL — an evidence-backed verdict. The value is a source-side fill, not a real per-service price: the hospital's file repeats a constant where distinct per-service prices should be. A small single-digit share of New York rows carries this flag; the Marketplace listing carries the current, date-stamped figure.

CONSTANT_RATE_PLATEAU — a neutral structural observation. One constant value appears across many codes of a payer column. That structure is consistent with real contract arrangements — blanket rates, case rates, tiered exports — so these rows are real data: annotated, never excluded. A larger share of rows carries this flag (the current figure is in the Marketplace listing). It is a structural fact you should know before per-code benchmarking, not a defect.

NULL — no pattern detected. This is not a certification. It means our detector found no constant-value pattern on that row, nothing more.

The two filter recipes#

Both recipes are documented because the right one depends on your question. The buyer chooses; we don't choose for you.

Strict per-code benchmarking — excludes both verdict and observation rows. The right default for per-code medians and percentiles, because a plateau column is per-code degenerate even when its constant is a real price:

WHERE AUDIT_FLAG IS NULL

Fills-only exclusion — excludes only verdict-grade fills and keeps plateau rows. The right choice when measuring a payer's overall posture or episode/blanket arrangements, where a real global case rate is signal, not noise:

WHERE AUDIT_FLAG IS DISTINCT FROM 'CONSTANT_VALUE_FILL'

Evidence on request#

Behind every flag is a row-level evidence payload — the detected pattern, its extent, and its verification status. That detail lives off-share and is available on request: steve@dourodata.com.