Cleaning algorithm finds 20% of errors in major image recognition datasets

[+] CydeWeys|6 years ago|reply

Why aren't these data sets editable instead of static? Treat them like a collaborative wiki or something (OpenStreetMap being the closest fit) and allow everyone to submit improvements so that all may benefit.

I hope the people in this article had a way to contribute back their improvements, and did so.

[+] 6gvONxR4sf7o|6 years ago|reply

The datasets serve as benchmarks. You get an idea for a new model that solves a problem current models have. These ideas don't pan out, so you need empirical evidence that it works. To show that your model does better than previous models, you need some task that your model and previous models can share for both training and evaluation. It's more complicated than that, but that's the gist.

It would be so wasteful to have to retrain a dozen models that require a month of GPU time each on to serve as baselines for your new model...

[+] polm23|6 years ago|reply

Multiple reasons, but to name a few:

- Don't want to deal with vandalism

- Hosting static data is dramatically easier than making a public editing interface

- You want reference versions of the dataset for papers to refer to so that results are comparable. Sometimes this is used as a justification for not fixing completely broken data, like with Fasttext.

https://github.com/facebookresearch/fastText/issues/710

- Building on the previous point, large datasets like this don't play nice with Git. There are lots of "git for data" things but none of them are very mature, and most people don't spend time trying to figure something out.

[+] seveibar|6 years ago|reply

I'm working on this[1], my theory is the lack of a good IDE (rather than simple crowdsourcing interface) is the reason why it hasn't been done.

Imagine if github had an integrated ide for editing large datasets. Also see dolt which is doing good work here.

[1] https://github.com/UniversalDataTool/universal-data-tool

[+] lmkg|6 years ago|reply

One major use of the public datasets in the academic community is to serve as a common reference when comparing new techniques against the existing standard. A static baseline is desirable for this task.

You could maybe split the difference by having an "original" or "reference" version, and a separate moving target that incorporates crowdsourced improvements.

[+] xiphias2|6 years ago|reply

One problem with correcting the benchmark datasets is that it's important for the algorithms to be robust to labelling errors as well. But having multiple versions sounds important anyways.

[+] andreyk|6 years ago|reply

Some datasets are indeed like this, see eg Common Voice - https://voice.mozilla.org/en

In general these things are open source, so you can always contribute an improved version of the dataset. But as another commenter said having relatively static ones is also important for benchmarking purposes.

[+] unknown|6 years ago|reply

[deleted]

[+] rathel|6 years ago|reply

Nothing is however said about how the errors are detected. Can an ML expert chime in?

[+] thibaut-duguet|6 years ago|reply

I'm a Product Manager at Deepomatic and I have been leading the study in question here. To detect the errors, we trained a model (with a different neural network architecture than the 6 listed in the post), and we then have a matching algorithm that highlights all bounding boxes that were either annotated but not predicted (False Negative), or predicted but not annotated (False Positive). Those potential errors are also sorted based on an error score to get first the most obvious errors. Happy to answer any other question you may have!

[+] ArnoVW|6 years ago|reply

my guess would be using some sort of active learning. In other words: 1) building a model using the data set 2) making predictions using the training data 3) finding the cases where the model is the most confused (difference in probability between classes is low) 4) raising those cases to humans

https://en.wikipedia.org/wiki/Active_learning_(machine_learn...

[+] captain_price7|6 years ago|reply

plus we'll have to register simply to see a few examples of mislabeling...that was disappointing

[+] kent17|6 years ago|reply

20% annotation error is huge, especially since those datasets (COCO, VOC) are used for basically every benchmark and state of the art research.

[+] rndgermandude|6 years ago|reply

And people wonder why I am still a bit skeptical of self-driving cars....

[+] peteradio|6 years ago|reply

Is it really 20% annotation error? I read it as 20% of the errors were detected. Errors could be some very small percent and of those that had error 20% were detected.

[+] magicalhippo|6 years ago|reply

> Create an account on the Deepomatic platform with the voucher code “SPOT ERRORS” to visualize the detected errors.

Nice ad.

[+] thibaut-duguet|6 years ago|reply

Our platform is actually designed for enterprise companies, so we don't provide open access unfortunately.

[+] fwip|6 years ago|reply

The title here seems wrong. Suggested change:

"Cleaning algorithm finds 20% of errors in major image recognition datasets" -> "Cleaning algorithm finds errors in 20% of annotations in major image recognitions."

We don't know if the found errors represent 20%, 90% or 2% of the total errors in the dataset.

[+] groar|6 years ago|reply

Yes agreed with that ! I can't change the title unfortunately

[+] unknown|6 years ago|reply

[deleted]

[+] kent17|6 years ago|reply

> We then used the error spotting tool on the Deepomatic platform to detect errors and to correct them.

I'm wondering if those errors are selected on how much they impact the performance?

Anyway, this is probably a much better way of gaining accuracy on the cheap than launching 100+ models for hyperparameter tuning.

[+] frenchie4111|6 years ago|reply

Best I can tell, they are using the ML model to detect the errors. Isn't this a bit of an ouroboros? The model will naturally get better, because you are only correcting problems where it was right but the label was wrong.

It's not necessarily a representation of a better model, but just of a better testing set.

[+] groar|6 years ago|reply

If I understand correctly they actually did not change the test set.

[+] benibela|6 years ago|reply

These things are why I stopped doing computer vision after my master thesis

[+] jontro|6 years ago|reply

Weird behaviour on pinch to zoom (macbook). It scrolls instead of zooming and when swiping back nothing happens.

Another example of why you should never mess with the defaults unless strictly necessary.

[+] groar|6 years ago|reply

Using simple techniques, they found out that popular open source datasets like VOC or COCO contain up to 20% annotation errors in. By manually correcting those errors, they got an average error reduction of 5% for state-of-the-art computer vision models.

[+] jessermeyer|6 years ago|reply

Garbage in garbage out.

[+] gringomarketing|6 years ago|reply

[deleted]

[+] m0zg|6 years ago|reply

An idea on how this could work: repeatedly re-split the dataset (to cover all of it), and re-train a detector on the splits, then at the end of each training cycle surface validation frames with the highest computed loss (or some other metric more directly derived from bounding boxes, such as the number of high confidence "false" positives which could be instances of under-labeling) at the end of training. That's what I do on noisy, non-academic datasets, anyway.

72 comments