List of high-quality open datasets in public domains

[+] dvcrn|10 years ago|reply

I don't quite understand these awesome lists. From what I've seen it usually ends up being a way for creators to promote their stuff and for the list creator to have a big project with a few thousand stars in their profile. So when I did something in say, electron, I would go to the awesome-electron list and add it there for promotion sake.

I couldn't find a usecase for these lists myself yet. There is no way to verify the quality of the product or the activity (stars for example? last commit date?).

In one case I searched for aws adapters for a language, clicked on all links inside awesome-{{language}} just to find that all of them are inactive or a few days young. I ended up using something I found on google instead.

[+] elcapitan|10 years ago|reply

When I started learning Golang last year I found the awesome-go list quite helpful. Not because I needed every single library mentioned, but because it quickly gave me an impression of how the ecosystem plays out and what typical ways to build stuff are. Because that's quite important in case you're not starting from an existing framework (.net or Rails etc).

[+] detaro|10 years ago|reply

It really depends on how well they are maintained and how selective they are. (E.g. a long list with a few links for many categories is often more useful than one that lists everything that possibly fits in one category). And some basic information for each entry is really needed.

GitHub offers a reasonable way to manage contributions to them, compared to many other solutions. It is easy to suggest/fix something for external contributors, but the owner can work as a gatekeeper. This is something many link aggregators or bookmarking sites lack.

One example that isn't perfect, but I found interesting was https://github.com/Kickball/awesome-selfhosted It has a sentence about each project, license information, and tries to purge unmaintained projects.

[+] 7952|10 years ago|reply

I think the benefit is as a source of inspiration. Finding data can require a lot of domain knowledge which students especially tend to lack. The problem is that the lists are always incomplete and other data may be more appropriate for a particular use case.

Also, quality is more complex than just stars or activity. You need to know what the advantages or limitations are and how that applies to your project.

[+] Kluny|10 years ago|reply

I'm quite keen to start learning about data science. A bunch of big datasets like this is exactly what I need.

[+] minimaxir|10 years ago|reply

69 points, #3 on Hacker News, and no comments? :P

This list would be much improved with descriptions for each dataset and indication of schema, as some of the datasets listed have very unfriendly schema. (e.g. the IMDB interfaces link)

Kaggle's recently-released Public Datasets feature (https://www.kaggle.com/datasets) provides an interesting approach to presenting data and qualifying datasets by giving good examples of data robustness.

[+] qume|10 years ago|reply

Exactly my thought as soon as I looked at the lack of comments.

How is it this community can debate some inconsequential nonsense, and there is no discussion here of how we get a consistent set of meta-data for these data sources.

There are researchers both in academia and in the commercial world who would thrive if there were such a list with good consistent meta-data on how to interact with it.

Disclosure: I work regularly with open datasets, and the effort it takes to work with each different set overshadows any effort on actual analysis.

[+] mistermann|10 years ago|reply

> 69 points, #3 on Hacker News, and no comments?

I expect this is because there is no ability on HN to differentiate (on articles and comments) between bookmark and upvote, most likely the majority of votes are for the purposes of bookmarking. Very often I want to upvote someone for a good comment, but I do that very sparingly now because I try to keep my upvoted comments list minimal so when I try to find something noteworthy I don't have to wade through pages of "liked" comments.

[+] davecap1|10 years ago|reply

SolveBio (my startup) has parsed, normalized, and indexed a bunch of the datasets listed under biology. Our goal is to make these kinds of datasets easier to access for programmers and non-programmers alike, similar to other some sites mentioned here (Enigma and Quandl) but for genomics. You can query and filter the data on the website or through one of our API clients: https://www.solvebio.com/library

[+] lap88|10 years ago|reply

Sounds like a good idea, especially the normalization part, but your site requires Java Script without even some basic functionality without it... nope.

[+] discardorama|10 years ago|reply

The author's notion of a "dataset" is weird. Under "Finance", there's a link to Google Finance page ( http://finance.google.com/ ). How is that a "dataset" ??

[+] chestnut-tree|10 years ago|reply

For those in the UK, the available Government datasets are published on http://www.data.gov.uk

The datasets are not public domain, but licensed under the Open Government Licence (which allows you to use and adapt the data for commercial use).

There's also the Global Open Data Index: a website that ranks countries by how much Government data is available as open datasets based on certain criteria. The current top spot is taken by Taiwan

  1. Taiwan
  2. UK
  3. Denmark
  4. Colombia
  5. Finland
  5. Australia
  7. Uruguay
  8. USA
  8. Netherlands
  10. Norway
  10. France

http://index.okfn.org/place/

[+] psykovsky|10 years ago|reply

You mean Colombia?

[+] clockwerx|10 years ago|reply

I wish linkeddata.org or ckan installs weren't being reinvented here, but instead ckan supported pull requests or similar decentralized ways to publish new data sets

[+] rossj|10 years ago|reply

If you have any ideas/suggestions for how this might be implemented in CKAN, please do drop a mail to the list ( https://lists.okfn.org/mailman/listinfo/ckan-dev ) or add an issue at https://github.com/ckan/ideas-and-roadmap/issues for discussion.

[+] yzh|10 years ago|reply

For the complex network part, I think the collection missed this one: http://www.networkrepository.com/ The site itself is a collection of several publicly available network datasets.

[+] jack9|10 years ago|reply

I noticed no http://commoncrawl.org/ (oh no, naked domain!) or http://www.cochrane.org/

I don't quite understand the criteria for being included in the list since I think it's:

https://groups.google.com/forum/#!forum/awesomepublicdataset...

[+] patrickk|10 years ago|reply

Betfair Historical Exchange Data requires you to have "100 Betfair points" which you acquire by gambling on their site. It's hardly an open dataset.

[+] Spooky23|10 years ago|reply

Check out data.ny.gov

Also nycopendata.socrata.com

[+] lifeisstillgood|10 years ago|reply

Is it too late to create a central registry of datasets - to aid discoverability. A voluntary system maintained by convention?

Perhaps a distributed registration system ala DNS?

[+] Symbiote|10 years ago|reply

We have https://www.biodiversitycatalogue.org/ for biodiversity informatics APIs. A hackathon I attended made an API for registering, but I don't think it was deployed.

[+] mikedillion|10 years ago|reply

Sure, go ahead!

[+] tylercubell|10 years ago|reply

Enigma.io is great for public data too.

[+] minimaxir|10 years ago|reply

I took another look at the Enigma.io public datasets. Over 50% of all the public datasets are from the Federal Reserve Bank of St. Louis. Finance data is boring. :P

Quandl (https://www.quandl.com/browse) is similar to Engima, except they got rid of all the fun datasets and added more finance/economic datasets. Hmrph.

[+] legulere|10 years ago|reply

It's strange that they put Wikidata under natural language.

34 comments