Mapping ISO 19115 to Lineage Tracking: Architecture & Implementation
Mapping ISO 19115 to Lineage Tracking requires translating the standard’s LI_Lineage metadata block into a directed acyclic graph (DAG) of data transformations, input sources, and responsible agents. The core mapping aligns LI_Source with input dataset nodes, LI_ProcessStep with transformation edges, and LI_Lineage/statement with high-level provenance context. Modern lineage systems consume this structure by extracting processStepDescription, dateTime, processor, and source references, then normalizing them into machine-readable records compatible with OpenLineage or W3C PROV-O. For GIS data stewards and compliance officers, the objective is converting narrative-heavy ISO 19115 records into auditable, queryable provenance trails that satisfy regulatory retention and reproducibility mandates.
Core Mapping Architecture
ISO 19115-1:2014 defines geospatial metadata lineage through a hierarchical XML structure that prioritizes human readability over machine traversal. When implementing Geospatial Lineage Fundamentals & Architecture, engineering teams must account for the fact that ISO 19115 lineage often lacks strict referential integrity between processing steps. Successful Compliance Framework Mapping requires normalizing free-text descriptions into structured process types, standardizing temporal formats, and linking processor roles to organizational identity systems.
The practical element-to-lineage mapping follows this schema:
| ISO 19115 Element | Lineage Graph Equivalent | Data Type / Format |
|---|---|---|
LI_Lineage/statement |
Lineage context/summary | String (metadata-level annotation) |
LI_Source/citation |
Source dataset node | Object (URI, title, version, spatial extent) |
LI_Source/scope |
Input data boundary constraints | String/Geometry reference |
LI_ProcessStep/description |
Transformation edge label | String (normalized to controlled vocabulary) |
LI_ProcessStep/dateTime |
Execution timestamp | ISO 8601 UTC (YYYY-MM-DDTHH:MM:SSZ) |
LI_ProcessStep/processor |
Actor/agent node | Object (organization, role, contact) |
LI_ProcessStep/source |
Input node reference | URI/UUID (creates graph edges) |
LI_ProcessStep/output |
Output node reference | URI/UUID (completes the DAG) |
Handling Namespace Variants & XML Parsing
ISO 19115 lineage metadata spans multiple standard revisions. Legacy implementations use the gmd namespace (ISO 19115-1:2003), while modern deployments adopt mrl (ISO 19115-3:2014/2018). A robust parser must query both paths without throwing NoneType exceptions. The Python Standard Library xml.etree.ElementTree module provides sufficient XPath-like navigation for this task, though production systems often migrate to lxml for stricter schema validation and faster iteration over large metadata catalogs.
When extracting lineage, always:
- Register all relevant namespaces upfront to avoid silent misses.
- Use
.findall()with fallback paths to capture both legacy and modern structures. - Strip whitespace and validate timestamps before serialization.
- Decouple extraction from graph ingestion to enable idempotent retries.
Python Extraction & Normalization
The following script parses ISO 19115 XML, handles namespace variations, and outputs a JSON structure ready for ingestion into lineage databases or graph stores.
import xml.etree.ElementTree as ET
import json
from datetime import datetime, timezone
from typing import List, Dict, Any, Optional
# ISO 19115 namespace mapping (covers 2003/2014 and ISO 19115-3 variants)
NAMESPACES = {
"gmd": "http://www.isotc211.org/2005/gmd",
"gco": "http://www.isotc211.org/2005/gco",
"gmx": "http://www.isotc211.org/2005/gmx",
"xlink": "http://www.w3.org/1999/xlink",
"mrl": "http://standards.iso.org/iso/19115/-3/mrl/1.0",
"cit": "http://standards.iso.org/iso/19115/-3/cit/2.0"
}
def _safe_text(element: Optional[ET.Element]) -> Optional[str]:
"""Safely extract and strip text content from an XML element."""
if element is not None and element.text:
return element.text.strip()
return None
def _normalize_timestamp(raw_dt: Optional[str]) -> Optional[str]:
"""Convert ISO 19115 dateTime to strict ISO 8601 UTC."""
if not raw_dt:
return None
try:
# Handle Z suffix and naive datetimes
dt_str = raw_dt.replace("Z", "+00:00")
dt_obj = datetime.fromisoformat(dt_str)
if dt_obj.tzinfo is None:
dt_obj = dt_obj.replace(tzinfo=timezone.utc)
return dt_obj.isoformat()
except ValueError:
return raw_dt # Fallback to raw string if parsing fails
def parse_iso19115_lineage(xml_path: str) -> Dict[str, Any]:
"""Parse ISO 19115 XML and extract lineage into a normalized DAG structure."""
tree = ET.parse(xml_path)
root = tree.getroot()
lineage: Dict[str, Any] = {
"statement": "",
"sources": [],
"process_steps": []
}
# Extract high-level lineage statement
stmt_paths = [".//mrl:LI_Lineage/mrl:statement", ".//gmd:LI_Lineage/gmd:statement"]
for path in stmt_paths:
elem = root.find(path, NAMESPACES)
if elem is not None:
lineage["statement"] = _safe_text(elem)
break
# Extract sources
src_paths = [".//mrl:LI_Source", ".//gmd:LI_Source"]
for path in src_paths:
for src_elem in root.findall(path, NAMESPACES):
citation = src_elem.find(".//gmd:CI_Citation/gmd:title", NAMESPACES)
scope = src_elem.find(".//gmd:scope", NAMESPACES)
identifier = src_elem.find(".//gmd:identifier/gmd:MD_Identifier/gmd:code", NAMESPACES)
lineage["sources"].append({
"title": _safe_text(citation),
"scope": _safe_text(scope),
"uri": _safe_text(identifier)
})
# Extract process steps
step_paths = [".//mrl:LI_ProcessStep", ".//gmd:LI_ProcessStep"]
for path in step_paths:
for step_elem in root.findall(path, NAMESPACES):
desc = step_elem.find(".//gmd:description", NAMESPACES)
dt = step_elem.find(".//gmd:dateTime", NAMESPACES)
processor = step_elem.find(".//gmd:processor", NAMESPACES)
lineage["process_steps"].append({
"description": _safe_text(desc),
"timestamp_utc": _normalize_timestamp(_safe_text(dt)),
"processor_role": _safe_text(processor.find(".//gmd:role", NAMESPACES)) if processor is not None else None,
"processor_org": _safe_text(processor.find(".//gmd:organisationName", NAMESPACES)) if processor is not None else None
})
return lineage
if __name__ == "__main__":
# Example usage
result = parse_iso19115_lineage("metadata.xml")
print(json.dumps(result, indent=2))
DAG Construction & Compliance Integration
Once parsed, the JSON output must be transformed into a graph model. Each LI_Source becomes a vertex with properties like dataset_id, spatial_extent, and version. Each LI_ProcessStep becomes a directed edge connecting an input vertex to an output vertex, annotated with transformation_type, executed_at, and responsible_agent. This structure enables downstream queries such as:
- Trace upstream dependencies for a published raster layer
- Identify all datasets processed by a specific agency role in Q3
- Validate temporal ordering of transformation steps
For compliance officers, the LI_Lineage/statement serves as an immutable provenance anchor. Regulatory frameworks often require explicit documentation of data origin, transformation logic, and custodian accountability. By mapping processor roles to enterprise IAM directories, organizations can automatically generate audit trails that satisfy frameworks like NIST SP 800-53 or ISO 27001. Additionally, enforcing controlled vocabularies for description fields prevents semantic drift across geospatial workflows, ensuring lineage remains machine-auditable over multi-year retention periods.
To maintain referential integrity, generate stable URIs for each LI_Source using persistent identifiers (e.g., DOIs, PIDs, or internal asset registry keys). Avoid relying on file paths or temporary database IDs, which break graph edges during system migrations. Finally, integrate automated schema validation into CI/CD pipelines to catch malformed XML or missing mandatory fields before they enter production lineage stores.
Conclusion
Mapping ISO 19115 to Lineage Tracking bridges legacy geospatial metadata standards with modern data observability practices. By parsing LI_Lineage blocks into normalized DAGs, GIS teams unlock queryable provenance, automated compliance reporting, and reproducible spatial analytics pipelines. The combination of namespace-aware parsing, strict timestamp normalization, and graph-based ingestion transforms narrative metadata into an enterprise-grade audit asset.