Provenance of function executions
Paper accepted at SemSci @ ISWC2017

About

A large part of scientific output entails computational experiments, e.g., processing data to generate new data. However, this generation process is only documented in human-readable form or as a software repository. This inhibits reproducibility and comparability, as current documentation solutions do not provide detailed metadata and rely on the availability of specific software environments.

Here, we propose an automatic capturing mechanism for interchangeable and implementation independent metadata and provenance that includes data processing. Using declarative mapping documents to describe the computational experiment, term-level provenance can be automatically captured, for both schema and data transformations, and storing both the used software tools as the input-output pairs of the data processing executions.

More specifically, this page shows how this provenance is captured applied to mapping documents described using RML and FnO, and implemented in the RMLMapper.

Example

As example, we generate Linked Data using a RML and FnO declarative document. Resources of type ex:Person are created based on a data source (:Person_LogicalSource). Each resource gets assigned a name (:NameMapping), using the predicate dbo:name. To get the name value, we use the function grel:toTitleCase, to have nicely formatted names in our dataset. For this, we use the "name" reference within our data source (:Person_LogicalSource).

:Person_TemplateMapping
    rml:logicalSource :Person_LogicalSource ;
    rr:subjectMap :Person_SubjectMap ;
    rr:predicateObjectMap :NameMapping .
:Person_SubjectMap
    rr:template "http://example.org/{id}" ;
    rr:class ex:Person .
:NameMapping
    rr:predicate dbo:name ;
    rr:objectMap [
        a fnml:FunctionTermMap ;
        fnml:functionValue [
            rml:logicalSource :Person_LogicalSource ;
            rr:predicateObjectMap [
                rr:predicate fno:executes ;
                rr:objectMap [ rr:constant grel:toTitleCase ] ] ;
            rr:predicateObjectMap [
                rr:predicate grel:valueParameter ;
                rr:objectMap [ rml:reference "name" ] ]
        ]
    ] .

To retrieve the actual implementation of grel:toTitleCase, we dereference grel:toTitleCase to get the actual implementation (http://example.com/grelFunctions.jar).

More info on this dereferencing can be found in our accompanying paper.

grel:toUppercase :implementedIn :grelJavaImpl .
:grelJavaImpl a prov:Agent;
    doap:file-release <http://example.com/grelFunctions.jar> .

<http://example.com/grelFunctions.jar> spdx:File ;
spdx:checksum [
    spdx:algorithm spdx:checksumAlgorithm_sha1 ;
    spdx:checksumValue "ffbdbf69f1572ea6a9f2da9a351a480f30070312"
] .

This can thus trigger the generation of the triple <http://example.com/1> dbo:name "Ben De Meester".

The generation provenance of the value "Ben De Meester" can be modeled as follows:

The FnO statements involved in the data processing can be automatically captured during the mapping process.

grel:toTitleCase a fno:Function, prov:Entity ;
    fno:name "title case" ;
    fno:expects ( [ fno:predicate grel:stringInput ] ) ;
    fno:output ( [ fno:predicate grel:stringOutput ] ) .

:exe a fno:Execution, prov:Activity ;
    :implementation :grelJavaImpl ;
    fno:executes grel:toTitleCase ;
    grel:stringInput :input ;
    grel:stringOutput :output .

    :input a prov:Entity ; rdf:value "ben de meester" .
    :output a prov:Entity ; rdf:value "Ben De Meester" .

These FnO statements allow us to derive the actual PROV-O information.

:NameMapping a prov:Activity .

:exe a prov:Activity ;
    prov:wasInformedBy :NameMapping ;
    prov:used :input ;
    prov:used grel:toTitleCase ;
    prov:wasAssociatedWith :grelJavaImpl ;
    prov:qualifiedAssociation [
        a prov:Association;
        prov:agent   :grelJavaImpl;
        prov:hadRole :implementation;
        prov:hadPlan grel:toTitleCase
    ] ;
    prov:startedAtTime "XXX"^^xsd:dateTime ;
    prov:endedAtTime "YYY"^^xsd:dateTime .

:output a prov:Entity ;
    prov:wasGeneratedBy :exe ;
    prov:wasAttributedTo :grelJavaImpl .

As a result, we have:

  • The output entities, i.e., the resulting value, give a clear list of output values.
  • The data transformation activities, i.e., the FnO execution triples, show implementation-independent what happened.
    This can be used to compare results with other implementations.
  • The tool agents, i.e., the actual implementation files (or remote APIs, or other ways of performing the actual function…), show exactly which implementation was responsible for the execution. This allows attribution/accountability.

This has the added benefits that:

  • Easy queries (e.g., querying all fno:Execution that fno:executes grel:toUppercase, that prov:wasInformedBy :NameMapping) can give an immedia ground truth
  • To compare data transformation tools, the actual mapping no longer needs to be executed, only the :input values are needed.
  • The prov:startedAtTime and prov:endedAtTime allow for easy performance evaluation.

More information can be found in the paper: Detailed Provenance Capture of Data Processing