DataCommons Architecture

The fundamental layers of the DataCommons are illustrated here:

DataCommons

Blockchain At the lowest level is a blockchain based on Hyperledger Fabric that contains a canonical representation of the DataCommons’ contents. Only data custodians can write to it but anyone inside or outside of OneCommons can read and duplicate it. This allows cheap, decentralized authentication of content and verification of transactions.

Data Stores All data in the DataCommons is stored in one of the following databases: Identity, Archive or the CommitLog.

API layer Applications running on the platform access the DataCommons through these database and API services, which are operated by the data custodians. API gateways are responsible for enforcing and managing ownership, usage rights, and provenance.

Data Stores

Identity

The Identity database provides global identity services for the platform, including user metadata, keys, and permissions.

Usage and data authorship is anonymized throughout the platform, only the data custodian knows the mapping between these anonymous identifiers and user accounts.
Applications only receive identifying information about these anonymous ids on a need-to-know basis and only if the user grants permission.
Applications are required to write any data they convey or create on behalf of their users to the CommitLog but it can be encrypted so only the user and permissioned apps can read it, not the data custodian managing the CommitLog.
Applications don’t do authentication themselves or have access to sensitive credentials like passwords, they rely on the OneCommons router for that.
In fact, authentication can happen on the client (e.g. in the browser) so the user’s credentials are not required to be known or stored at all (just don’t lose them!)
Pseudonymity: users can create as many accounts as they like and there is no system-wide requirement to collect personally identifying information about the user.
Decentralized: thanks to the blockchain, users can use their OneCommons identity and credentials for authentication outside of the platform without its participation or knowledge.

CommitLog

Any data an application publishes for external use is committed in the CommitLog’s history. Each commit includes provenance metadata identifying the application build and the user that created it. Personal and private data is encrypted as described above and public data includes licensing metadata.

Any application writing to the DataCommons public APIs has to provide a schema that describes which data should appear in the CommitLog. The DataCommon services that implement these APIs use this schema to update the CommitLog in parallel with their native updates.

The schemas may contain:

privacy and ownership annotations
configuration annotations
backward and forward schema migration
merge strategies

Applications can also subscribe to changes in the CommitLog and apply those changes to its native store. This is equivalent to forking or branching in a version control system and the CommitLog will track and can apply history-based merge and syncing strategies when changes diverge.

This feature of the CommitLog enables the forking and remixing of live, running applications – a unique and tremendously powerful ability of the OneCommons platform.

API

Applications and services access the DataCommons through services that provide familiar APIs for persistence. These include:

File

Provided for applications that need a file object API (AWS S3-like) or a block device.

PubSub

Publish and subscribe service provides direct access to the DataCommons’ data stores using Apache Kafka.

Database services

Most applications and services will access the DataCommons through familiar SQL and NoSQL database services.

This diagram illustrates how applications built with traditional database (initially Mysql, Postgres, and MongoDB) can utilize the DataCommons.

DataCommons

The main requirement is for a database driver that has access to the front-end request or session id provided by the OneCommons service mesh and can log it alongside the transaction id.

With that, an asynchronous process can deduce provenance metadata from the session id and merge it with the transaction by reading the database’s native transaction log it uses for replication (using a tool like Debezium).

That process would then use the schema associated with the database and application build-id to transform the updates in the transaction into a JSON representation that can be committed into the CommitLog.