DataCommons Architecture

The fundamental layers of the DataCommons are illustrated here:

DataCommons

Blockchain At the lowest level is a blockchain based on Hyperledger Fabric that contains a canonical representation of the DataCommons’ contents. Only data custodians can write to it but anyone inside or outside of OneCommons can read and duplicate it. This allows cheap, decentralized authentication of content and verification of transactions.

Data Stores All data in the DataCommons is stored in one of the following databases: Identity, Archive or the CommitLog.

API layer Applications running on the platform access the DataCommons through these database and API services, which are operated by the data custodians. API gateways are responsible for enforcing and managing ownership, usage rights, and provenance.

Data Stores

Identity

The Identity database provides global identity services for the platform, including user metadata, keys, and permissions.

CommitLog

Any data an application publishes for external use is committed in the CommitLog’s history. Each commit includes provenance metadata identifying the application build and the user that created it. Personal and private data is encrypted as described above and public data includes licensing metadata.

Any application writing to the DataCommons public APIs has to provide a schema that describes which data should appear in the CommitLog. The DataCommon services that implement these APIs use this schema to update the CommitLog in parallel with their native updates.

The schemas may contain:

Applications can also subscribe to changes in the CommitLog and apply those changes to its native store. This is equivalent to forking or branching in a version control system and the CommitLog will track and can apply history-based merge and syncing strategies when changes diverge.

This feature of the CommitLog enables the forking and remixing of live, running applications – a unique and tremendously powerful ability of the OneCommons platform.

Archive

The archive is used by the persistency services provided by the API as permanent storage. It is partitioned by application and each partition is encrypted with the application’s key. It essentially provides backup and restore services for apps – essential for providing a fully reproducible system and for app migration between Cloud Providers.

API

Applications and services access the DataCommons through services that provide familiar APIs for persistence. These include:

File

Provided for applications that need a file object API (AWS S3-like) or a block device.

PubSub

Publish and subscribe service provides direct access to the DataCommons’ data stores using Apache Kafka.

Database services

Most applications and services will access the DataCommons through familiar SQL and NoSQL database services.

This diagram illustrates how applications built with traditional database (initially Mysql, Postgres, and MongoDB) can utilize the DataCommons.

DataCommons

The main requirement is for a database driver that has access to the front-end request or session id provided by the OneCommons service mesh and can log it alongside the transaction id.

With that, an asynchronous process can deduce provenance metadata from the session id and merge it with the transaction by reading the database’s native transaction log it uses for replication (using a tool like Debezium).

That process would then use the schema associated with the database and application build-id to transform the updates in the transaction into a JSON representation that can be committed into the CommitLog.