I don't quite understand these awesome lists. From what I've seen it usually ends up being a way for creators to promote their stuff and for the list creator to have a big project with a few thousand stars in their profile. So when I did something in say, electron, I would go to the awesome-electron list and add it there for promotion sake.
I couldn't find a usecase for these lists myself yet. There is no way to verify the quality of the product or the activity (stars for example? last commit date?).
In one case I searched for aws adapters for a language, clicked on all links inside awesome-{{language}} just to find that all of them are inactive or a few days young. I ended up using something I found on google instead.
When I started learning Golang last year I found the awesome-go list quite helpful. Not because I needed every single library mentioned, but because it quickly gave me an impression of how the ecosystem plays out and what typical ways to build stuff are. Because that's quite important in case you're not starting from an existing framework (.net or Rails etc).
It really depends on how well they are maintained and how selective they are. (E.g. a long list with a few links for many categories is often more useful than one that lists everything that possibly fits in one category). And some basic information for each entry is really needed.
GitHub offers a reasonable way to manage contributions to them, compared to many other solutions. It is easy to suggest/fix something for external contributors, but the owner can work as a gatekeeper. This is something many link aggregators or bookmarking sites lack.
One example that isn't perfect, but I found interesting was https://github.com/Kickball/awesome-selfhosted It has a sentence about each project, license information, and tries to purge unmaintained projects.
I think the benefit is as a source of inspiration. Finding data can require a lot of domain knowledge which students especially tend to lack. The problem is that the lists are always incomplete and other data may be more appropriate for a particular use case.
Also, quality is more complex than just stars or activity. You need to know what the advantages or limitations are and how that applies to your project.
This list would be much improved with descriptions for each dataset and indication of schema, as some of the datasets listed have very unfriendly schema. (e.g. the IMDB interfaces link)
Kaggle's recently-released Public Datasets feature (https://www.kaggle.com/datasets) provides an interesting approach to presenting data and qualifying datasets by giving good examples of data robustness.
Exactly my thought as soon as I looked at the lack of comments.
How is it this community can debate some inconsequential nonsense, and there is no discussion here of how we get a consistent set of meta-data for these data sources.
There are researchers both in academia and in the commercial world who would thrive if there were such a list with good consistent meta-data on how to interact with it.
Disclosure: I work regularly with open datasets, and the effort it takes to work with each different set overshadows any effort on actual analysis.
I expect this is because there is no ability on HN to differentiate (on articles and comments) between bookmark and upvote, most likely the majority of votes are for the purposes of bookmarking. Very often I want to upvote someone for a good comment, but I do that very sparingly now because I try to keep my upvoted comments list minimal so when I try to find something noteworthy I don't have to wade through pages of "liked" comments.
SolveBio (my startup) has parsed, normalized, and indexed a bunch of the datasets listed under biology. Our goal is to make these kinds of datasets easier to access for programmers and non-programmers alike, similar to other some sites mentioned here (Enigma and Quandl) but for genomics. You can query and filter the data on the website or through one of our API clients: https://www.solvebio.com/library
Sounds like a good idea, especially the normalization part, but your site requires Java Script without even some basic functionality without it... nope.
The author's notion of a "dataset" is weird.
Under "Finance", there's a link to Google Finance page ( http://finance.google.com/ ). How is that a "dataset" ??
For those in the UK, the available Government datasets are published on http://www.data.gov.uk
The datasets are not public domain, but licensed under the Open Government Licence (which allows you to use and adapt the data for commercial use).
There's also the Global Open Data Index: a website that ranks countries by how much Government data is available as open datasets based on certain criteria. The current top spot is taken by Taiwan
1. Taiwan
2. UK
3. Denmark
4. Colombia
5. Finland
5. Australia
7. Uruguay
8. USA
8. Netherlands
10. Norway
10. France
I wish linkeddata.org or ckan installs weren't being reinvented here, but instead ckan supported pull requests or similar decentralized ways to publish new data sets
For the complex network part, I think the collection missed this one: http://www.networkrepository.com/ The site itself is a collection of several publicly available network datasets.
We have https://www.biodiversitycatalogue.org/ for biodiversity informatics APIs. A hackathon I attended made an API for registering, but I don't think it was deployed.
I took another look at the Enigma.io public datasets. Over 50% of all the public datasets are from the Federal Reserve Bank of St. Louis. Finance data is boring. :P
Quandl (https://www.quandl.com/browse) is similar to Engima, except they got rid of all the fun datasets and added more finance/economic datasets. Hmrph.
[+] [-] dvcrn|10 years ago|reply
I couldn't find a usecase for these lists myself yet. There is no way to verify the quality of the product or the activity (stars for example? last commit date?).
In one case I searched for aws adapters for a language, clicked on all links inside awesome-{{language}} just to find that all of them are inactive or a few days young. I ended up using something I found on google instead.
[+] [-] elcapitan|10 years ago|reply
[+] [-] detaro|10 years ago|reply
GitHub offers a reasonable way to manage contributions to them, compared to many other solutions. It is easy to suggest/fix something for external contributors, but the owner can work as a gatekeeper. This is something many link aggregators or bookmarking sites lack.
One example that isn't perfect, but I found interesting was https://github.com/Kickball/awesome-selfhosted It has a sentence about each project, license information, and tries to purge unmaintained projects.
[+] [-] 7952|10 years ago|reply
Also, quality is more complex than just stars or activity. You need to know what the advantages or limitations are and how that applies to your project.
[+] [-] Kluny|10 years ago|reply
[+] [-] minimaxir|10 years ago|reply
This list would be much improved with descriptions for each dataset and indication of schema, as some of the datasets listed have very unfriendly schema. (e.g. the IMDB interfaces link)
Kaggle's recently-released Public Datasets feature (https://www.kaggle.com/datasets) provides an interesting approach to presenting data and qualifying datasets by giving good examples of data robustness.
[+] [-] qume|10 years ago|reply
How is it this community can debate some inconsequential nonsense, and there is no discussion here of how we get a consistent set of meta-data for these data sources.
There are researchers both in academia and in the commercial world who would thrive if there were such a list with good consistent meta-data on how to interact with it.
Disclosure: I work regularly with open datasets, and the effort it takes to work with each different set overshadows any effort on actual analysis.
[+] [-] mistermann|10 years ago|reply
I expect this is because there is no ability on HN to differentiate (on articles and comments) between bookmark and upvote, most likely the majority of votes are for the purposes of bookmarking. Very often I want to upvote someone for a good comment, but I do that very sparingly now because I try to keep my upvoted comments list minimal so when I try to find something noteworthy I don't have to wade through pages of "liked" comments.
[+] [-] davecap1|10 years ago|reply
[+] [-] lap88|10 years ago|reply
[+] [-] discardorama|10 years ago|reply
[+] [-] chestnut-tree|10 years ago|reply
The datasets are not public domain, but licensed under the Open Government Licence (which allows you to use and adapt the data for commercial use).
There's also the Global Open Data Index: a website that ranks countries by how much Government data is available as open datasets based on certain criteria. The current top spot is taken by Taiwan
http://index.okfn.org/place/[+] [-] psykovsky|10 years ago|reply
[+] [-] clockwerx|10 years ago|reply
[+] [-] rossj|10 years ago|reply
[+] [-] yzh|10 years ago|reply
[+] [-] jack9|10 years ago|reply
I don't quite understand the criteria for being included in the list since I think it's:
https://groups.google.com/forum/#!forum/awesomepublicdataset...
[+] [-] patrickk|10 years ago|reply
[+] [-] Spooky23|10 years ago|reply
Also nycopendata.socrata.com
[+] [-] lifeisstillgood|10 years ago|reply
Perhaps a distributed registration system ala DNS?
[+] [-] Symbiote|10 years ago|reply
[+] [-] mikedillion|10 years ago|reply
[+] [-] tylercubell|10 years ago|reply
[+] [-] minimaxir|10 years ago|reply
Quandl (https://www.quandl.com/browse) is similar to Engima, except they got rid of all the fun datasets and added more finance/economic datasets. Hmrph.
[+] [-] legulere|10 years ago|reply