top | item 28274766

(no title)

mulcyber | 4 years ago

Hot take: there is no "bad" data.

It's a term we often hear, that implies there is "good" and "bad" data.

A dataset can have errors in labeling, be very small, be unbalanced, but all that can be managed with the proper methods.

THE biggest problem is when you training data does not correspond to the production use-case.

It's not that the dataset is "bad", it's just that the problem you're solving with your ML algorithm trained on that data does not correspond to the problem you're trying to solve.

The most "perfect" ML algorithm trained on the most "perfect" dataset for self-driving cars for example (for detection, segmentation of objects or whatever) made the US will have problems when the cars drive in an other country. Your MNIST-trained NN will have problems in a country where numbers are written slightly differently. Some people will put pictures of cats in your car model classification software. Pictures taken on a smartphone by your users will be different than your dataset scrapped on the web.

There is no bad data, just badly used data. And most of the work (and the most interesting part IMO) in ML is to identify, quantify and neutralize biases in models and differences between the data you have and the data the production system will work with.

discuss

No comments yet.