Today, a story on Techmeme caught our eye. It was entitled "We Need a Wikipedia for data," and the article, written by X-Googler Bret Taylor, discussed the difficulty of finding open data sets on the internet, something which could spur innovation, allowing programmers to build new applications the likes of which have never been seen before. What was interesting about this story, in addition to, obviously, the concept of a Data Wiki itself, was the amazing and insightful commentary around this concept, not just on the blog, but all over the net, something which led to the discovery of some pretty good data sources that are already available.
In Bret's story, he mentioned some of the common data sources currently available, like the US Census Bureau's map data and the Reuters corpus, but his commenters came up with a few more. (See? This is why blog comments matter).
In addition, as CNet and Ryan Stewart's blog spread the story, more people chimed in with suggestions. And of course, the Hacker News guys had some more ideas themselves.
So what did everyone come up with? A lot of data sources are already freely available on the net, as it turns out, if you just know where to look. Here's a summary, do you have anything to add?
The CKAN site is a registry of open knowledge packages and projects. Here, you can find open knowledge resources or register one of your own. What kind of stuff can you find at CKAN? They mention a set of Shakespeare's works, a global population density database, the voting records of MPs, or 30 years of US patents as some examples, but they also point you to some useful URLs, like flickr's Creative Commons page, where photos can be searched by license type.
CKAN
This project is attempting to assemble and interconnect the world's best repository for raw data - like a giant, free, open almanac. The best way to describe it comes from MetaFilter, where the project was spotted recently: "Just as Wikipedia will help you find out something about everything, infochimps.org will help you find out everything about something." What can you find there? Every wikipedia infobox, each infobox type in its own table, 50 years of global hourly weather data, all the tables from the US Census Statistical Abstract, oh and 100,000 official crossword words, too.
Infochimps.org
Not a data set in the traditional sense, but definitely a useful tool, OpenStreetMap is a free, editable map of the world where you can view, edit, and use your own geographical data. The project was started because most maps actually have legal or technical restrictions on their use.
OpenStreetMap
Musicbrainz
Dismissed by the blogosphere as a bad idea, if not downright evil, Jigsaw, the marketplace that pays you to give up other people's contact info now boasts 7 million complete contacts for the taking.
This site is a community effort to extract structured info from Wikipedia and make that data publicly available on the web, essentially turning Wikipedia into a database you can query. Is this the beginnings of a semantic web? Check out their downloads section for the datasets and then scroll to the bottom for even more links to data sources on the web.
DBpedia
Where DBpedia takes Wikipedia and makes it semantic, flickr wrappr extends DBpedia with RDF links to photos posted on flickr. Here's an example. Here's another. This is pure geek hotness.
Freebase, an open, shared database of the world's knowledge, received a lot of mentions in the comments, so this must be a good one. Community built and maintained, it pulls from open data sources like Wikipedia, MusicBrainz, and the SEC archives to create structured information on many topics, including more popular ones like movies, music, people, and locations. The site, unlike some of the others in this list, is also easy to navigate and well-designed, which makes it that much better to use.
Freebase
Perhaps one of the less interesting items due to its dry subject matter - financial data - it's certainly worth a mention because a free database of real-time and historical market data for trading systems and platforms is the kind of thing that really floats some people's boats.
Thanks to LibraryThing, ThingISBN is the site's first API, and even though its competitor became a paid service, ThingISBN is still free for non-commercial use. The API doesn't just return the usual book data, but also something called "edition disambiguation," meaning it also returns a list of "related" ISBNs—other editions, other media, and translations.
Like the title suggests, Numbrary is a library for numbers. This free service helps you find, use, and share numbers from public record data sets, like census data or the CIA World Factbook.
Numbrary
This site isn't just a place to build or collect data sets, of which they have quite a nice list, but a place where you can interact with other number-lovin' folks like yourself.
theinfo.org
This blog post lists a bunch, and I mean a bunch, of open datasets on the web, which just goes to show how much of a cursory list my post really is.
Comments
Subscribe to comments for this post OR Subscribe to comments for all Read/WriteWeb posts
On a related note check out InfoAesthetics, re data visualization http://infosthetics.com/
Posted by: Marshall Kirkpatrick
|
April 9, 2008 10:34 AM
Damn Sarah.... you find the coolest stuff!!! hehehe... how do you do it!
Posted by: Matt | April 9, 2008 10:37 AM
Cogmap provides a great resource for organizational data, as well as web services for mashing up this information, which doesn't fit well into structured data sources such as freebase.
Posted by: Brent | April 9, 2008 11:00 AM
Im assuming this have some sort of api ? There not to useful if I have to scrape the html everyday.
Posted by: cease | April 9, 2008 11:09 AM
Great resource Sara! Just wanted to add a really great one you missed:
http://del.icio.us/tag/publicdata
Jon Udell came up with the idea to get the community to tag all open/public data sets with the tag "publicdata" on delicious. It has become a really nice resource for all kinds of data sets. More here:
http://blog.jonudell.net/2007/07/17/revisiting-language-evolution-in-delicious/
and for what it is worth, a few of our posts along these lines here:
http://www.kirix.com/blog/category/data-search/
Posted by: Ken Kaczmarek | April 9, 2008 11:23 AM
Wikinvest.com just launched a cool product (WikiData)...it gives industry specific financials for companies. For instance, Same Store Sales is only important for retailers, where revenue per passenger is only important for airlines. They have a user generated DB of this sort of industry specific data....really good stuff.
-Wayne
Posted by: Wayne Mulligan | April 9, 2008 11:38 AM
How about Swivel (www.swivel.com)?
Posted by: Perry Mizota | April 9, 2008 11:42 AM
OpenCyc?
http://en.wikipedia.org/wiki/Cyc
http://www.opencyc.org/
Posted by: Phillip Rhodes | April 9, 2008 1:01 PM
Jeez Louise, this is a whole sub specialty, Just ask Kingsley Idehen! His software runs DBpedia.
There are any number of pay-to-play web services for Geo and open retail pricing data. Most importantly there is UDEF.com, where the experts from the aerospace industry are wrangling open data vocabularies and standards using the whole alphabet soup of XML, RDF, WS, etc. You would not know it look at the web site (UGLY), but these folks are monster experts.
Well, the UDEF man is from Lockheed , I think, and he was a nice man, the most interesting person at the XML Open seminar at Oracle last year - well, I am more interesting, but not for the utility of my knowledge, rather for my quickness of wit.
Yes, yes i found him, : Ron Schuldt , you should talk to him for a future article.
Hey! Anyone out there got any work for smart Jewish boy? Contract product manager, vertical markets, writing, speaking, PR, community relations? Industry relations? Wax yo car?
Posted by: Alan Wilensky | April 9, 2008 1:06 PM
I started http://infochimps.org to try to build part of this "Almanac" (ALLmanac?) to sit next to wikipedia's "Encyclopedia". It's a huge project, one that will require massive community involvement and cooperation among the projects cited above.
I think the main virtues of http://infochimps.org lie in its suckiness:
- we're "messy". We're looking to loosely couple data: make it discoverable, make it publicly curated, make it interconnect -- but not to impose any kind of strict structure or format or ontology. You can sit happily in our DB with nothing more than a title, a list of credits, and a few tags.
- we're "stupid" -- no live access. As much as possible, we'd like to give infochimps data to work with on their machine, using their tools (and not incidentally their CPU cycles).
- we're "not good at any one specific thing". There's sites with Economic data, with UN data, with astronomical data, with baseball statistics, with social network graphs. We need a place that allows and in fact inspires connections among all these rich sources of data, and gives you immediate access to them.
We're about to enter the age of ubiquitous information. Drawing these data stores into open formats, making them discoverable, and interconnecting them across knowledge domains presents explosive opportunities. But who will own this data and what access will they allow? Infochimps wants to ensure that the answer is 'everyone' and 'all of it', but these projects will only succeed with community involvement. If you suspect you may also be an infochimp, please get in touch.
Posted by: mrflip.myopenid.com
|
April 9, 2008 2:02 PM
WHOIS data is not easy to get either. Here is an attempt at it and the response from ARIN.
http://ideaindustries.net/archive/2007/04/06/getting-access-to-public-whois-data-a-frustrating-update.aspx
Posted by: romno | April 9, 2008 2:21 PM
I think you missed http://data.un.org which is a great resource for information regarding developing countries.
Posted by: PIerre N | April 9, 2008 2:26 PM
Great post, thanks for including Datawrangling... I keep my list of datasets updated here: http://del.icio.us/pskomoroch/dataset
There have been a number of new data links added to my bookmarks since I did my last post, so it might be worth digging in.
Organizing and standardizing these datasets can be a difficult problem and I'm encouraged by the progress from sites like freebase, dbpedia, and numbrary... In the meantime, there is a lot of value in just providing raw data dumps in a few basic formats (like infochimps is doing) so people can get things done.
For many projects, I would often prefer to get direct access to compressed flat text, YAML, or XML files instead of repeatedly calling a web api to build up a local copy of structured data, or dealing with latency of fetching it in realtime.
Posted by: Pete Skomoroch | April 9, 2008 3:19 PM
SmartHippo.com has mortgage rate data available under a Creative Commons license.
Posted by: Jeff Cozington | April 9, 2008 3:44 PM
You forgot to include the wiki http://www.numberzoom.com/ for looking up user-contributed phone numbers.
Posted by: Jillian | April 9, 2008 4:25 PM
Great article! UN Data ( http://data.un.org/ ) is another massive data source. Tools like Yahoo Pipes and Feedity.com also make it easier to build feeds and mashups from raw data, and reuse them for more objective purposes.
Posted by: Ashutosh | April 9, 2008 5:17 PM
Just wanted to also plug the fact that Zillow neighborhood boundaries (for the US) available under a creative commons license -- http://www.zillow.com/labs/NeighborhoodBoundaries.htm
Posted by: Drew Meyers | April 9, 2008 5:37 PM
This is a great list.
I also compiled a similar list of data sets lists a month or so ago but you found several that I missed.
Posted by: Mark Reid | April 9, 2008 9:05 PM
Hi, you forgot to include the Data360 project.
Description from home page:
Posted by: fionda | April 9, 2008 9:11 PM
Hi, you forgot to include the Data360 project.
Description from home page:
Posted by: fionda | April 9, 2008 9:18 PM
Check out DataPlace - they've done a lot of work to harmonize work so it's actually comparable. Don't think they have APIs but they have a wonderful system for anyone who wants to look at their community.
http://www.dataplace.org/
Posted by: Darlene Fichter | April 9, 2008 9:33 PM
Great list!
See also the Open Linking Data project, which is putting exposing these kinds of datasets on the (data) Web as linked data (and many also have SPARQL APIs). Check the diagram.
Posted by: Danny | April 10, 2008 3:14 AM
Seeking Alpha has made conference call transcripts freely available for the first time. They're now searchable:
http://seekingalpha.com/article/69866-googles-killer-app-for-investors-consultants-and-journalists
Posted by: David | April 10, 2008 4:38 AM
I think Bret's idea (and the Freebase idea, and some other wiki-data aggregation ideas that have come about) are good, but a bit misguided.
I think that it's more likely that Free Data will be developed on multiple platforms in different vertical spaces. TVIV is a great collection of TV data; Open Guides has awesome business listings, and Chefmoz has great restaurant data and reviews. I'm part of a great project called Vinismo to document every wine and winery in the world with structured and unstructured data.
I think the continued growth in Free Data will be helped by making these different efforts license-compatible (to allow re-mixing and sharing) and data-compatible. I think much-maligned efforts like the Semantic Web and RDF make it easier to combine and re-use data from different fields of endeavour.
CKAN is a great project to get these different groups talking to one another (literally and figuratively).
Posted by: evan.prodromou.name
|
April 10, 2008 6:16 AM
Roll your own:
http://mcdc2.missouri.edu/websas/geocorr2k.html
Posted by: Dick C. Flatline | April 10, 2008 6:30 AM
Can someone point me to the 50 years of global hourly weather data in infochimps.org? I cannot find it enough Info Chimps mentions on its site as well.
Posted by: Weather Looker | April 10, 2008 6:32 AM
On a related note, check out SWSE, the Semantic Web Search Engine, which provides access to a multi-million corpus collected from the Web via a SPARQL endpoint.
Posted by: Andreas Harth | April 10, 2008 10:13 AM
The weather data won't be live at http://infochimps.org until the next site update. If you email me flip at infochimps.org I will send you a username / password to get the raw data.
Posted by: mrflip.myopenid.com
|
April 10, 2008 2:53 PM
the opendatamovement is aiming on this topic!
Posted by: id.berlotti.com
|
April 11, 2008 4:40 AM
http://www.swivel.com/
Posted by: Brent | April 11, 2008 8:14 PM
See also http://www.geonames.org/ for geolocalization data :-)
Posted by: Antonio Bonanno | April 13, 2008 3:53 PM
It's rather a focused effort for chemists but ChemSpider provides access to almost 20 million compounds and related data. For example, for butanol (http://www.chemspider.com/q/butanol) you can see the structure, links to multiple other sources of information, property data and so on. This effort links together 100 data sources into one free access site.
Posted by: ChemSpiderMan | April 13, 2008 4:26 PM
Currently I am working on StYLiD (Structure Your own Linked Data). Like Freebase or Google Base, it allows the users to define their own structured concepts and share any type of data based on that. It is a social platform like digg.com or revyu.com but it allows to share a wide variety of data - anything the users can conceptualize. Moreover, it consolidates multiple definitions for the same concept and forms virtual concepts incrementally. The project is in initial stage now. It is aimed at structured information sharing on the web rather than creating another Wikipedia database or world's knowledge. However, if it gets popular and people start sharing various types of information everyday, it will turn out to be a useful collection of open data.
Posted by: Aman | April 13, 2008 9:01 PM
open ontologies and semantic thesauri are an organic next step to deepen and qualify the knowledge hidden in wikipedia. some categorization efforts are already moving forward (if they wouldn't get sidetracked with the wikia social network and search engine). the approach of creating a large silo for collecting an abundance of free data to do all kinds of datamining (and marketing research) tricks goes into another direction.
http://www.scribd.com/doc/9582/integrating-wikipediawordnet
http://www.opencyc.org/
http://wordnet.princeton.edu/
Posted by: pit schultz | April 14, 2008 7:58 PM
SERVIR is a regional visualization and monitoring system for Mesoamerica that integrates satellite and other geospatial data for improved scientific knowledge and decision making by managers, researchers, students, and the general public. The bilingual SERVIR website provides open access to satellite and other geospatial datasets
http://www.servir.net
Posted by: ktl | April 15, 2008 9:20 AM
If found a nice dataset for readers interested in weather.
Dutch meteorological institute (KNMI) has digitalized 1.6 million hourly meteorological observations from the Amsterdam City Water Office in the Netherlands for the period 1784 - 1963.
Measurements for temperature, pressure, winddirection and windspeed are free available for download in 4 zipped text files. You can also take a look at the handwritings for all measurements. There is also a report with more information about the dataset in english.
The files:
http://www.knmi.nl/klimatologie/stadswaterkantoor/uitleg.html
The handwritten logpages:
http://www.knmi.nl/klimatologie/stadswaterkantoor/index.cgi
Enjoy!
Posted by: Borb | April 16, 2008 3:28 PM
Advantage Base Knowledge http://advantagebk.com/ has lists of telemarketing reverse lookup numbers, although it is not very user friendly.
Posted by: headly | April 19, 2008 2:28 AM
STUMBLED!
Nice list, haven't heard of most of these.
VOTED for you at:
http://www.newsdots.com/industrynews/where-to-find-open-data-on-the-web/
Posted by: Geoserv | April 19, 2008 7:01 PM