A large part of scientific output entails computational experiments, e.g., processing data to generate new data. However, this generation process is only documented in human-readable form or as a software repository. This inhibits reproducibility and comparability, as current documentation solutions do not provide detailed metadata and rely on the availability of specific software environments.
Here, we propose an automatic capturing mechanism for interchangeable and implementation independent metadata and provenance that includes data processing. Using declarative mapping documents to describe the computational experiment, term-level provenance can be automatically captured, for both schema and data transformations, and storing both the used software tools as the input-output pairs of the data processing executions.
More specifically, this page shows how this provenance is captured applied to mapping documents described using RML and FnO, and implemented in the RMLMapper.
As example, we generate Linked Data using a RML and FnO declarative document.
Resources of type ex:Person
are created based on a data source (:Person_LogicalSource
).
Each resource gets assigned a name (:NameMapping
),
using the predicate dbo:name
.
To get the name value, we use the function grel:toTitleCase
,
to have nicely formatted names in our dataset.
For this, we use the "name"
reference within our data source (:Person_LogicalSource
).
:Person_TemplateMapping
rml:logicalSource :Person_LogicalSource ;
rr:subjectMap :Person_SubjectMap ;
rr:predicateObjectMap :NameMapping .
:Person_SubjectMap
rr:template "http://example.org/{id}" ;
rr:class ex:Person .
:NameMapping
rr:predicate dbo:name ;
rr:objectMap [
a fnml:FunctionTermMap ;
fnml:functionValue [
rml:logicalSource :Person_LogicalSource ;
rr:predicateObjectMap [
rr:predicate fno:executes ;
rr:objectMap [ rr:constant grel:toTitleCase ] ] ;
rr:predicateObjectMap [
rr:predicate grel:valueParameter ;
rr:objectMap [ rml:reference "name" ] ]
]
] .
To retrieve the actual implementation of grel:toTitleCase
,
we dereference grel:toTitleCase
to get the actual implementation (http://example.com/grelFunctions.jar
).
More info on this dereferencing can be found in our accompanying paper.
grel:toUppercase :implementedIn :grelJavaImpl .
:grelJavaImpl a prov:Agent;
doap:file-release <http://example.com/grelFunctions.jar> .
<http://example.com/grelFunctions.jar> spdx:File ;
spdx:checksum [
spdx:algorithm spdx:checksumAlgorithm_sha1 ;
spdx:checksumValue "ffbdbf69f1572ea6a9f2da9a351a480f30070312"
] .
This can thus trigger the generation of the triple <http://example.com/1> dbo:name "Ben De Meester"
.
The generation provenance of the value "Ben De Meester"
can be modeled as follows:
The FnO statements involved in the data processing can be automatically captured during the mapping process.
grel:toTitleCase a fno:Function, prov:Entity ;
fno:name "title case" ;
fno:expects ( [ fno:predicate grel:stringInput ] ) ;
fno:output ( [ fno:predicate grel:stringOutput ] ) .
:exe a fno:Execution, prov:Activity ;
:implementation :grelJavaImpl ;
fno:executes grel:toTitleCase ;
grel:stringInput :input ;
grel:stringOutput :output .
:input a prov:Entity ; rdf:value "ben de meester" .
:output a prov:Entity ; rdf:value "Ben De Meester" .
These FnO statements allow us to derive the actual PROV-O information.
:NameMapping a prov:Activity .
:exe a prov:Activity ;
prov:wasInformedBy :NameMapping ;
prov:used :input ;
prov:used grel:toTitleCase ;
prov:wasAssociatedWith :grelJavaImpl ;
prov:qualifiedAssociation [
a prov:Association;
prov:agent :grelJavaImpl;
prov:hadRole :implementation;
prov:hadPlan grel:toTitleCase
] ;
prov:startedAtTime "XXX"^^xsd:dateTime ;
prov:endedAtTime "YYY"^^xsd:dateTime .
:output a prov:Entity ;
prov:wasGeneratedBy :exe ;
prov:wasAttributedTo :grelJavaImpl .
As a result, we have:
This has the added benefits that:
fno:Execution
that fno:executes grel:toUppercase
, that prov:wasInformedBy :NameMapping
)
can give an immedia ground truth
:input
values are needed.
prov:startedAtTime
and prov:endedAtTime
allow for easy performance evaluation.
More information can be found in the paper: Detailed Provenance Capture of Data Processing