Recently , I’ve been channeling a bit of a research vibe. I’ve been looking at thermal comfort levels , and how to determine our ability to stay warm within our respective environments. Part of this research, involves dealing with satellite data. As I started trolling the internet for api’s and searching for who has the data or would vend it to me . I stumbled across quite a few nasa gov. sites, that had various pieces of data that I wanted . The most crazy thing, was that I didn’t think to look there first. So before I write more about government open data. I’d like to use this post to just generate a high level list of the things that I believe would be required to make govt. data more useful.
### Discoverability. Discoverability in this sense would be a very simple unified , platform that makes govt data searchable , on a google / wolfram math, word2vec level. We need a way to simply list the frequency, and type of data available. ### Understandability Having to parse govt. documents is tantamount to impossible without spending at least a week , reading about some old proprietary format. There needs to be simple programs that are example programs, demonstrating how the data might be used to calculate something that might be useful. A regression or an average number of B grade restaurants in new york. Or the average cloud coverage in Spain. ### Github The Government needs its own github repository for people , that don’t work at the govt. or for the govt. Currently , if you write some json parser for an obscure govt format you have now where to post it that would link back to the originating source of data. This would go a long way towards the first 3 suggestions. ### Data Cluster Currently JPL and some govt. faciltiies oddly enough will allow you to ssh to a machine and take over a supercomputer for a period of time. Why they’d do this makes very little sense to me, but they do. An alternative in my mind is to simply put all the govt. data in a large govt sponsored hadoop cluster. That has some restrictions on when jobs will be run , and while at it have another system that simply is a pub-sub system that , allows users to simply subscribe to govt. data as it comes in. Basically this would be a public computation cluster that , would stream out data from various govt. sources. The system should help replace or augment the antiquated ftp services that currently exist. The net benefit , is the platform would be govt. sponsored and available to all govt, agencies to simply throw their data into hadoop in their proprietary format.
Government data, is hard to use not very accessible, and open data , doesn’t really go far enough . There are a million little blockers , that prevent people from using the data. Why not make it really easy, for both parties. Offload fancy formatting to tax-payers , and tax-payers get to use computational resources that they already pay for in a more productive way. Finally, we might be able to see what govt. data is useful given the polarity of various open source projects that are directly tied to publicly available govt. data.