Archive Transparency - Resilient Internet Preservation

Trust Tomorrow

Everything digital is a copy of a copy.

Tomorrow's history is built with today's online artifacts, often stored on ephemeral storage media, without guarantees against backdated metadata or unobserved modifications.

Today's capabilities to generate large amounts of convincing fakes are improving fast, making it impossible to distinguish historical documents from forged artifacts.

Bad actors will try to leverage this for their own profit, putting pressure on current digital preservation efforts, sowing irreparable doubt into the genuineness of the recorded past.

Now is our last chance to build foundations for tomorrow's trust.

Record Today

Current digital archives are very successful at recording our present at scale. These document repositories are part of the backbone of the productive economy as anchors of trust online.

Generative technologies will soon be employed to wage economic war. Perfect operational security is extremely difficult and archives cannot afford to become a target.

Modern public key infrastructures and software supply chains have already solved similar problems, ensuring that billions of artifacts are tamper-evident every day.

We aim to integrate these technologies into current archival practices. We cannot store every document in the world, but we can distribute tamper-evident logs that record their existence.

We build software to support existing digital archives.

Prove Past

We are building a proof of past provenance.

We design for the century scale and we use plain text for storage.

We use simple cryptographic primitives to convincingly claim that future users will be able to check that our index is genuine. We aim for threat resilience and cryptographic agility.

We believe in a trust model with public entities signing all online digital artifacts. Tamper-evident transparency logs have proven at scale that such a trust ecosystem can be built.

We build for local-first offline use as this is the best way to get these document indexes in lots of different places. We optimize for file storage instead of databases.

This is a long way to go. We are moving forward right now.

Learn more

Is this about trust online?

Yes, but not only.

We are used to relying on reputable sources online. However, for many legitimate reasons, these may delete, modify or move the various documents they are vouching for. Long term, we end up relying on third-party repositories of unknown trust, redistributing information.

We trust documents we find online simply because they look authentic or because we found them in a website that seems legitimate. We are unable to distinguish genuine documents stored by a honest third-party from convincing fakes with backdated metadata.

This is a weakness that sooner or later, will be exploited.

Why such a sense of urgency?

Because we are all late to the party.

The need for authentication in the archival landscape has long been known. The folks at LOCKSS drafted their threat model two decades ago, the LTANS working group concluded fifteen years ago, and Webrecorder integrated trusted timestamping several years ago.

However, most data online today is still not obviously authenticated. Digital archiving at scale is hard, budgets are small and issues like copyright are limits that stop content-oriented approaches. Today bad actors have all the capabilities to leverage these weaknesses.

We aim for a metadata-oriented approach that integrates on top of existing document repositories. We want anyone to be able to build today a tamper-evident index of their digital artifacts, for cheap, and in a way fit for long-term storage and large scale distribution.

What kind of technologies?

We believe that the ideas underlying Certificate Transparency, Sigsum and Sigstore have a role to play in building tamper-evident historical records. They are tracking every year billions of records in transparency logs, a scale comparable to existing digital archives.

Web archiving today is built with WARC files together with CDX index files matching metadata with SHA-1 or MD5 checksums. There exists many more standards for digital archives, each pairing items to their metadata. We do not propose replacing these standards.

We propose to append supplemental index files, text-based, capturing just enough metadata together with cryptographic hashes, signatures and trusted timestamps. We want to empower anyone to build and share these indexes, which are structured as transparency logs.

We are deliberately trying to keep things simple then later add value.

What kind of ecosystem?

We are not a root of trust.

We believe that distributed governance is key. We aim to use transparency logs to distribute trust between national institutions, non-profits, and private individuals.

We design to integrate with the existing transparency ecosystem. We believe that logs only need to monitor that new entries added are never backdated. We aim to provide recovery paths from partial log corruption, as well as several backup plans against catastrophic log failure.

This will require transparency-enforcing tools and clients. We want to enable anyone to write their own, to create software diversity. This will require open standards and we aim to assist the wider archiving community at making fast progress on these issues.

We want to enable both the small scale and the large scale by enabling clients to interact offline with slices of the index. We want to enable anyone to operate their own archive, large or small, and build tamper-evident logs of the artifacts they care about.

What kind of present use?

The output of Archive Transparency can be described as a large plaintext index of all the item hashes and metadata that people want to keep around, with a proof of past provenance.

This can be leveraged by third-parties to build content addressable storage as well as, at the cost of generating and storing document embeddings, efficient retrieval with full-text search.

We believe that this can open a path to search through digital archives at scale without interacting with the archive itself. This can both enable lower costs for digital archives, as well as great value for users and opportunities for others to build independent services.

What about the century scale?

The problems surrounding digital archives can be described as century-scale.

We do not aim to solve everything revolving around digital archives, but we do aim to bring into the long-term innovations like tamper-evident transparency logs.

Scaling file formats and protocols to several decades is not a trivial task as we cannot predict the future. This is in some fundamental way a guess at what will last, or not.

We still believe that some meaningful choices are possible to make, from using text files as our preferred way of storing metadata, extensible designs, forward-compatibility, etc.

Is this blockchain or not?

Call it a ledger if you want!

Blockchain technologies spawned many innovations making full use of modern cryptography. However, we are solving for a different kind of problem. Yes, we are using append-only immutable trees of hashes. No, we are not building a blockchain-based system.

We believe here that our limiting factor is not good cryptography.

We are making opinionated choices specific to long-term digital preservation and believe that transparency logs are a better fit for this. We also believe that blockchain-based notary services will have some role to play, but we can not afford to rely on them.

Is this a work in progress?

We are still in the early stages of the project.

We are currently working on a research prototype to experiment with the design space available to us. The short term goal is to build a small-scale deployment of a million documents.

We invite you to contact us to help us scale to billions of records!

Trust Tomorrow

Record Today

Prove Past

build the future

Learn more

contact

interested in supporting resilient digital preservation?