Weekly Update: Rufus Pollock

OCTOBER 17, 2011

Availability

Activity

  • In San Francisco last week Tuesday - Saturday. Attended Code for America Summit and had a variety of useful meetings including with Wikimedia Foundation folks including Erik Möller and Dario Taraborelli, Dan Whaley of Hypothes.is and Max Ogden.
  • Below are some written up notes from excellent chat with Max

This Week

Max Ogden Chat

1. Changes protocol

Want a general changes protocol for data (can we generalize CouchDB’s _changes to lots of other stuff including the Webstore)

  • A changes protocol => syncing between DBs => support for updates and distributed systems

2. Diffing and merging

  • We lack a diff (patch) format and a merge protocol (see previous posts We Need Distributed Revision Control for Data and (older) Collaborative Development of Data)
  • Note these issues are related but not identical – full-on merging, as in e.g. git or mercurial is about more than simple patch application.
  • Diff format options (are there more?)
    1. Brute force: e.g. serialize to text and use git
    2. Identify atomic structure (e.g. document) and apply diff at that level (think CouchDB or standard copy-on-write for RDBMS at row level)
    3. Recording transforms (e.g. Refine)
  • Capturing diffs at document level in a given system e.g. CouchDB (trivial as can just provide new and old document) or an SQL database (approx = Write-Ahead Log) isn’t that hard though often not immediately exposed (as with SQL) and, more importantly, specific to the data store and, often, type of data
  • Aside: if using something like CouchDB how would we capture edits when editing ‘offline’ e.g. on CSV version locally

3. Micro Schemas

Micro-schemas = just enough of standardized schema to apps and tools together with data in a minimal way

  • Micro-schemas = small schemas (“conventions”) for data that would support interoperability and allow for generic tools and apps.
  • Simple example: one would have a geodata micro-schema for tabular data that required every dataset with a long / lat for points to have a field named geometry containing geojson representing that point. That way one could build a data browser to present any such dataset.
  • Not only should these schemas be very simple, they would be combinable so a given dataset could support many of these.
  • Aside: I’d sort of prefer the term knowledge or information API here as it would suggest all the nice analogies with code: schemas are just liked APIs but for information: they provide a standard way for other systems to interact with that material. Plus, things like the combinability I just referred to in the previous point, have a nice analogy with MixIns or Abstract interfaces in code where a given Class can implement many MixIns or Abstract interfaces. However, I fear the term API may just be too technical and “geeky”.