Transparency Talk

« How the Reporting Commitment Leverages Philanthropy's Efforts to Solve Pressing Social Problems | Main | Glasspockets Find: Learn Foundation Law Pools Resources to Offer Legal Training to Private Foundations »

Beyond Alphabet Soup: 5 Guidelines For Data Sharing
August 29, 2013

(Andy Isaacson is Forward Deployed Engineer at Palantir Technologies. This blog is re-posted from the Markets for Good blog. Please see the accompanying reference document: Open Data Done Right: Five Guidelines – available for download and for you to add your own thoughts and comments.)

The BaIsaacson-100tcomputer was ingenious. In the 1960s Batman television series, the machine took any input, digested it instantly, and automagically spat out a profound insight or prescient answer – always in the nick of time (watch what happens when Batman feeds it alphabet soup). Sadly, of course, it was fictional. So why do we still cling to the notion that we can feed in just any kind of data and expect revelatory output? As the saying goes, garbage in yields garbage out; so, if we want quality results, we need to begin with high quality input. Open Data initiatives promise just such a rich foundation.

High quality, freely available data means hackers everywhere, from Haiti to Hurricane Sandy, are now building the kinds of analytical tools we need to solve the world’s hardest problems.

Presented with a thorny problem, any single data source is a great start – it gives you one facet of the challenge ahead. However, to paint a rich analytical picture with data, to solve a truly testing problem, you need as many other facets as you can muster. You can often get these by taking openly available data sets and integrating them with your original source. This is why the Open Data movement is so exciting. It fills in the blanks that lead us to critical insights: informing disaster relief efforts with up-to-the-minute weather data, augmenting agricultural surveys with soil sample data, or predicting the best locations for Internally Displaced Persons camps using rainfall data.

High quality, freely available data means hackers everywhere, from Haiti to Hurricane Sandy, are now building the kinds of analytical tools we need to solve the world’s hardest problems. But great tools and widely-released data isn’t the end of the story.

At Palantir, we believe that with great data comes great responsibility, both to make the information usable, and also to protect the privacy and civil liberties of the people involved. Too often, we are confronted with data that’s been released in a haphazard way, making it nearly impossible to work with. Thankfully, I’ve got one of the best engineering teams in the world backing me up – there’s almost nothing we can’t handle. But Palantir engineers are data integration and analysis pros – and Open Data isn’t about catering to us.

It is, or should be, about the democratization of data, allowing anybody on the web to extract, synthesize, and build from raw materials – and effect change. In a recent talk to a G-8 Summit on Open Data for Agriculture, I outlined the ways we can help make this happen:

#1 – Release structured raw data others can use

#2 – Make your data machine-readable

#3 – Make your data human-readable

#4 – Use an open-data format

#5 – Release responsibly and plan ahead

Abbreviated explanations below. Download the full version here: Open Data, Done Right: Five Guidelines.

#1 – Release structured raw data others can use

One of the most productive side effects of data collection is being able to re-purpose a set collected for one goal and use it towards a new end. This solution-focused effort is at the heart of Open Data. One person solves one problem; someone else takes the exact same dataset and re-aggregates, re-correlates, and remixes it into novel and more powerful work. When data is captured thoroughly and published well, it can be used and re-used in the future too; it will have staying power.

Release data in a raw, structured way – think a table of individual values rather than words – to enable its best use, and re-use.

#2 – Make your data machine-readable.

Once structured, raw data points are integrated into an analysis tool (like one of the Palantir platforms), a machine needs to know how to pick apart the individual pieces.

Even if the data is structured and machine readable, building tools to extract the relevant bits takes time, so another aspect of this rule is that a dataset’s structure should be consistent from one release to the next. Unless there’s a really good reason to change it, next month’s data should be in the exact same format as this month’s, so that the same extraction tools can be used again and again.

Use machine-readable, structured formats like CSV, XML, or JSON to allow the computer to easily parse the structure of data, now and in future.

#3 – Make your data human-readable.

Now that the data can be fed into an analysis tool, it is vital for humans, as well as machines, to understand what it actually means. This is where PDFs come in handy. They are an awful format for a data release as they can be baffling for automatic extraction programs. But, as documentation, they can explain the data clearly to those who are using it.

Assume nothing – document and explain your data as if the reader has no context.

#4 – Use an open-data format.

Proprietary data formats are fine for internal use, but don’t force them on the world. Prefer CSV files to Excel, KMLs to SHPs, and XML or JSON to database dumps. It might sound overly simplistic, but you never know what programming ecosystem your data consumers will favor, so plainness and openness is key.

Choose to make data as simple and available as possible: When releasing it to the world, use an open data format.

#5 – Release responsibly and plan ahead

Now that the data is structured, documented, and open, it needs to be released to the world. Simply posting files on a website is a good start, but we can do better, like using a REST API.

Measures that protect privacy and civil liberties are hugely important in any release of data. Beyond simply keeping things up-to-date, programmatic API access to your data allows you to go to the next level of data responsibility. By knowing who is requesting the data, you can implement audit logging and access controls, understanding what was accessed when and by whom, and limiting exposure of any possibly sensitive information to just the select few that need to see it.

Allow API access to data, to responsibly provide consumers the latest information – perpetually.

...

These guidelines seem simple, almost too simple. You might wonder why in this high tech world we need to keep things so basic when we have an abundance of technological solutions to overcome data complexity.

Sure, it’s all theoretically possible. However, in practice, anybody working with these technologies knows that they can be brittle, inaccurate, and labor intensive. Batman’s engineers can pull off extracting data from pasta, but for the rest of us, relying on heroic efforts means a massive, unnecessary time commitment – time taken away from achieving the fundamental goal: rapid, actionable insight to solve the problem.

There’s no magic wand here, but there are some simple steps to make sure we can share data easily, safely and effectively. As a community of data consumers and providers, together we can make the decisions that will make Open Data work.

-- Andy Isaacson

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.

Share This Blog

  • Share This

Subscribe to Transparency Talk

  • Enter your email address:

About Transparency Talk

  • Transparency Talk, the Glasspockets blog, is a platform for candid and constructive conversation about foundation transparency and accountability. In this space, Foundation Center highlights strategies, findings, and best practices on the web and in foundations–illuminating the importance of having "glass pockets."

    The views expressed in this blog do not necessarily reflect the views of the Foundation Center.

    Questions and comments may be
    directed to:

    Janet Camarena
    Director, Transparency Initiatives
    Foundation Center

    If you are interested in being a
    guest contributor, contact:
    glasspockets@foundationcenter.org

Categories