Why aren't these data sets editable instead of static? Treat them like a collaborative wiki or something (OpenStreetMap being the closest fit) and allow everyone to submit improvements so that all may benefit.
I hope the people in this article had a way to contribute back their improvements, and did so.
The datasets serve as benchmarks. You get an idea for a new model that solves a problem current models have. These ideas don't pan out, so you need empirical evidence that it works. To show that your model does better than previous models, you need some task that your model and previous models can share for both training and evaluation. It's more complicated than that, but that's the gist.
It would be so wasteful to have to retrain a dozen models that require a month of GPU time each on to serve as baselines for your new model...
- Hosting static data is dramatically easier than making a public editing interface
- You want reference versions of the dataset for papers to refer to so that results are comparable. Sometimes this is used as a justification for not fixing completely broken data, like with Fasttext.
- Building on the previous point, large datasets like this don't play nice with Git. There are lots of "git for data" things but none of them are very mature, and most people don't spend time trying to figure something out.
One major use of the public datasets in the academic community is to serve as a common reference when comparing new techniques against the existing standard. A static baseline is desirable for this task.
You could maybe split the difference by having an "original" or "reference" version, and a separate moving target that incorporates crowdsourced improvements.
One problem with correcting the benchmark datasets is that it's important for the algorithms to be robust to labelling errors as well. But having multiple versions sounds important anyways.
In general these things are open source, so you can always contribute an improved version of the dataset. But as another commenter said having relatively static ones is also important for benchmarking purposes.
I'm a Product Manager at Deepomatic and I have been leading the study in question here. To detect the errors, we trained a model (with a different neural network architecture than the 6 listed in the post), and we then have a matching algorithm that highlights all bounding boxes that were either annotated but not predicted (False Negative), or predicted but not annotated (False Positive). Those potential errors are also sorted based on an error score to get first the most obvious errors. Happy to answer any other question you may have!
my guess would be using some sort of active learning. In other words:
1) building a model using the data set
2) making predictions using the training data
3) finding the cases where the model is the most confused (difference in probability between classes is low)
4) raising those cases to humans
Is it really 20% annotation error? I read it as 20% of the errors were detected. Errors could be some very small percent and of those that had error 20% were detected.
"Cleaning algorithm finds 20% of errors in major image recognition datasets" -> "Cleaning algorithm finds errors in 20% of annotations in major image recognitions."
We don't know if the found errors represent 20%, 90% or 2% of the total errors in the dataset.
Best I can tell, they are using the ML model to detect the errors. Isn't this a bit of an ouroboros? The model will naturally get better, because you are only correcting problems where it was right but the label was wrong.
It's not necessarily a representation of a better model, but just of a better testing set.
Using simple techniques, they found out that popular open source datasets like VOC or COCO contain up to 20% annotation errors in. By manually correcting those errors, they got an average error reduction of 5% for state-of-the-art computer vision models.
An idea on how this could work: repeatedly re-split the dataset (to cover all of it), and re-train a detector on the splits, then at the end of each training cycle surface validation frames with the highest computed loss (or some other metric more directly derived from bounding boxes, such as the number of high confidence "false" positives which could be instances of under-labeling) at the end of training. That's what I do on noisy, non-academic datasets, anyway.
[+] [-] CydeWeys|6 years ago|reply
I hope the people in this article had a way to contribute back their improvements, and did so.
[+] [-] 6gvONxR4sf7o|6 years ago|reply
It would be so wasteful to have to retrain a dozen models that require a month of GPU time each on to serve as baselines for your new model...
[+] [-] polm23|6 years ago|reply
- Don't want to deal with vandalism
- Hosting static data is dramatically easier than making a public editing interface
- You want reference versions of the dataset for papers to refer to so that results are comparable. Sometimes this is used as a justification for not fixing completely broken data, like with Fasttext.
https://github.com/facebookresearch/fastText/issues/710
- Building on the previous point, large datasets like this don't play nice with Git. There are lots of "git for data" things but none of them are very mature, and most people don't spend time trying to figure something out.
[+] [-] seveibar|6 years ago|reply
Imagine if github had an integrated ide for editing large datasets. Also see dolt which is doing good work here.
[1] https://github.com/UniversalDataTool/universal-data-tool
[+] [-] lmkg|6 years ago|reply
You could maybe split the difference by having an "original" or "reference" version, and a separate moving target that incorporates crowdsourced improvements.
[+] [-] xiphias2|6 years ago|reply
[+] [-] andreyk|6 years ago|reply
In general these things are open source, so you can always contribute an improved version of the dataset. But as another commenter said having relatively static ones is also important for benchmarking purposes.
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] rathel|6 years ago|reply
[+] [-] thibaut-duguet|6 years ago|reply
[+] [-] ArnoVW|6 years ago|reply
https://en.wikipedia.org/wiki/Active_learning_(machine_learn...
[+] [-] captain_price7|6 years ago|reply
[+] [-] kent17|6 years ago|reply
[+] [-] rndgermandude|6 years ago|reply
[+] [-] peteradio|6 years ago|reply
[+] [-] magicalhippo|6 years ago|reply
Nice ad.
[+] [-] thibaut-duguet|6 years ago|reply
[+] [-] fwip|6 years ago|reply
"Cleaning algorithm finds 20% of errors in major image recognition datasets" -> "Cleaning algorithm finds errors in 20% of annotations in major image recognitions."
We don't know if the found errors represent 20%, 90% or 2% of the total errors in the dataset.
[+] [-] groar|6 years ago|reply
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] kent17|6 years ago|reply
I'm wondering if those errors are selected on how much they impact the performance?
Anyway, this is probably a much better way of gaining accuracy on the cheap than launching 100+ models for hyperparameter tuning.
[+] [-] frenchie4111|6 years ago|reply
It's not necessarily a representation of a better model, but just of a better testing set.
[+] [-] groar|6 years ago|reply
[+] [-] benibela|6 years ago|reply
[+] [-] jontro|6 years ago|reply
Another example of why you should never mess with the defaults unless strictly necessary.
[+] [-] groar|6 years ago|reply
[+] [-] jessermeyer|6 years ago|reply
[+] [-] gringomarketing|6 years ago|reply
[deleted]
[+] [-] m0zg|6 years ago|reply