top | item 11002423

List of high-quality open datasets in public domains

388 points| Jasamba | 10 years ago |github.com | reply

34 comments

order
[+] dvcrn|10 years ago|reply
I don't quite understand these awesome lists. From what I've seen it usually ends up being a way for creators to promote their stuff and for the list creator to have a big project with a few thousand stars in their profile. So when I did something in say, electron, I would go to the awesome-electron list and add it there for promotion sake.

I couldn't find a usecase for these lists myself yet. There is no way to verify the quality of the product or the activity (stars for example? last commit date?).

In one case I searched for aws adapters for a language, clicked on all links inside awesome-{{language}} just to find that all of them are inactive or a few days young. I ended up using something I found on google instead.

[+] elcapitan|10 years ago|reply
When I started learning Golang last year I found the awesome-go list quite helpful. Not because I needed every single library mentioned, but because it quickly gave me an impression of how the ecosystem plays out and what typical ways to build stuff are. Because that's quite important in case you're not starting from an existing framework (.net or Rails etc).
[+] detaro|10 years ago|reply
It really depends on how well they are maintained and how selective they are. (E.g. a long list with a few links for many categories is often more useful than one that lists everything that possibly fits in one category). And some basic information for each entry is really needed.

GitHub offers a reasonable way to manage contributions to them, compared to many other solutions. It is easy to suggest/fix something for external contributors, but the owner can work as a gatekeeper. This is something many link aggregators or bookmarking sites lack.

One example that isn't perfect, but I found interesting was https://github.com/Kickball/awesome-selfhosted It has a sentence about each project, license information, and tries to purge unmaintained projects.

[+] 7952|10 years ago|reply
I think the benefit is as a source of inspiration. Finding data can require a lot of domain knowledge which students especially tend to lack. The problem is that the lists are always incomplete and other data may be more appropriate for a particular use case.

Also, quality is more complex than just stars or activity. You need to know what the advantages or limitations are and how that applies to your project.

[+] Kluny|10 years ago|reply
I'm quite keen to start learning about data science. A bunch of big datasets like this is exactly what I need.
[+] minimaxir|10 years ago|reply
69 points, #3 on Hacker News, and no comments? :P

This list would be much improved with descriptions for each dataset and indication of schema, as some of the datasets listed have very unfriendly schema. (e.g. the IMDB interfaces link)

Kaggle's recently-released Public Datasets feature (https://www.kaggle.com/datasets) provides an interesting approach to presenting data and qualifying datasets by giving good examples of data robustness.

[+] qume|10 years ago|reply
Exactly my thought as soon as I looked at the lack of comments.

How is it this community can debate some inconsequential nonsense, and there is no discussion here of how we get a consistent set of meta-data for these data sources.

There are researchers both in academia and in the commercial world who would thrive if there were such a list with good consistent meta-data on how to interact with it.

Disclosure: I work regularly with open datasets, and the effort it takes to work with each different set overshadows any effort on actual analysis.

[+] mistermann|10 years ago|reply
> 69 points, #3 on Hacker News, and no comments?

I expect this is because there is no ability on HN to differentiate (on articles and comments) between bookmark and upvote, most likely the majority of votes are for the purposes of bookmarking. Very often I want to upvote someone for a good comment, but I do that very sparingly now because I try to keep my upvoted comments list minimal so when I try to find something noteworthy I don't have to wade through pages of "liked" comments.

[+] davecap1|10 years ago|reply
SolveBio (my startup) has parsed, normalized, and indexed a bunch of the datasets listed under biology. Our goal is to make these kinds of datasets easier to access for programmers and non-programmers alike, similar to other some sites mentioned here (Enigma and Quandl) but for genomics. You can query and filter the data on the website or through one of our API clients: https://www.solvebio.com/library
[+] lap88|10 years ago|reply
Sounds like a good idea, especially the normalization part, but your site requires Java Script without even some basic functionality without it... nope.
[+] chestnut-tree|10 years ago|reply
For those in the UK, the available Government datasets are published on http://www.data.gov.uk

The datasets are not public domain, but licensed under the Open Government Licence (which allows you to use and adapt the data for commercial use).

There's also the Global Open Data Index: a website that ranks countries by how much Government data is available as open datasets based on certain criteria. The current top spot is taken by Taiwan

  1. Taiwan
  2. UK
  3. Denmark
  4. Colombia
  5. Finland
  5. Australia
  7. Uruguay
  8. USA
  8. Netherlands
  10. Norway
  10. France
http://index.okfn.org/place/
[+] psykovsky|10 years ago|reply
You mean Colombia?
[+] yzh|10 years ago|reply
For the complex network part, I think the collection missed this one: http://www.networkrepository.com/ The site itself is a collection of several publicly available network datasets.
[+] patrickk|10 years ago|reply
Betfair Historical Exchange Data requires you to have "100 Betfair points" which you acquire by gambling on their site. It's hardly an open dataset.
[+] Spooky23|10 years ago|reply
Check out data.ny.gov

Also nycopendata.socrata.com

[+] lifeisstillgood|10 years ago|reply
Is it too late to create a central registry of datasets - to aid discoverability. A voluntary system maintained by convention?

Perhaps a distributed registration system ala DNS?

[+] tylercubell|10 years ago|reply
Enigma.io is great for public data too.
[+] minimaxir|10 years ago|reply
I took another look at the Enigma.io public datasets. Over 50% of all the public datasets are from the Federal Reserve Bank of St. Louis. Finance data is boring. :P

Quandl (https://www.quandl.com/browse) is similar to Engima, except they got rid of all the fun datasets and added more finance/economic datasets. Hmrph.

[+] legulere|10 years ago|reply
It's strange that they put Wikidata under natural language.