There is quite a bit of interest on campus and other places about data libraries, data sharing, open data, DSIC … and that is just the tip of the iceberg. This topic is both relevant to research and government, and through those to probably a great deal of human endeavor. Much of the conversations on this topic that I’ve observed or overheard have related to professional challenges in managing the process, facilitating access, promoting awareness, soliciting data, verifying and evaluation quality, and similar issues. Then I stumble across this presentation saying how to do it simply, easily, with virtually no money or resources to help the people in your own local community. Provocative to think about in the context of the larger initiative!

Slideshare: DIY data store for your town (William Perrin):

  1. Alas, that seems to need sound or notes attached…

    But currently there’s a *HUGE* issue with data curation. The NSF demands a data management plan in new grants. People are clueless about both what is meant and how to accomplish what they believe might be meant.

    The good side is that there is some funding recognition outside NASA and the NIH about data management. The bad side is that this was dropped into requirements without the people evaluating grants quite grasping the consequences. (And no, I don’t know anyone talking to librarians. Even when some people mention the connection…)

    NASA’s had this problem for at least a decade and a half. They can’t even transfer all their data from older storage media to newer media within the older media’s lifetime. NIH has somewhat recognized the problem, but mostly on collection and less on curation.


    • Absolutely! Data curation is part of what I was thinking of along with “data libraries” – all the aspects of what professional librarians and libraries do with anything they collect (select, protect, promote, preserve, etcetera). Not intending to imply that those critical conversations don’t need to happen and that these core issues don’t need to be addressed, just fascinated that folk are moving ahead at the grass roots level at the same time that others are tackling the issue from other end.

  2. Yup! Some examples with which I’m familiar are the UF sparse matrix collection ( ), the UCI machine learning data sets ( ), and various graph databases. Each is somewhat carefully curated, and *none* can keep up with the scale of modern computational problems. These repositories are maintained by necessity and not by direct funding. The “practical” problems all are at least one order of magnitude larger than anything in curated data sets.

    And that’s not even addressing the relatively straight-forward data sets like satellite imagery, climate data, etc.

    The grass roots exist in my areas (sparse matrices and graphs), but they’re almost detrimental. The somewhat unfunded collections are used in funded research, but they don’t adequately represent the forefront of desired capabilities. It’s not the fault of any of the curators. Extracting information from the entities (gov’t agencies, companies, etc.) that want to push the state of the art is frightfully difficult.

    • Very very real and practical problems, Jason. I don’t have a clue what the solution will be, but somehow it will have to work out. Someone will come up with some creative and brilliant test to account for variation in different data sets, or will come up with a way to get researchers to deliver data in a standardized format. Or something else that is beyond me. I am glad that you are discussing the problem and approaches, and hope others partner with you on finding solutions.

