top | item 45932009

(no title)

ghm2180 | 3 months ago

I would offer a stronger more pointed observation, ofen the problem in building a good classifier is having good negative examples. More generally how a classifier identify good negatives is a function of:

1. Data collection technique.

2. Data annotation(labelling).

3. Classfier can learn on your "good" negatives — quantitaively depending on the machine residuals/margin/contrastive/triplet losses — i.e. learn the difference between a negative and positive for a classifier at train time and the optimization minima is higher than at test time.

4. Calibration/Reranking and other Post Processing.

My guess is that they hit a sweet spot with the first 3 techniques.

discuss

jacquesm|3 months ago

I think the biggest problem with such classifiers is to actually know what is good data and what is bad data. To take a sample of the data and to recognize whether or not this dataset is a general enough representation of both true and false examples (for a binary classifier) to be able to use it to train a model. Because it isn't rare at all to have data sets that are biased 100 to 1 or more for one of the classes, which contain hints about what class the object is in that isn't in the object itself and so on. You can train until the cows come home on such data but it will never lead to satisfactory results.

ghm2180|3 months ago

So the bias is an issue can be handled in a variety of ways, one which I know to work is to use weights on your rarer class when training. You could also use larger margins to make sure you definitely don't mis-classify the rare class at the cost of mislableling your dominant class — presuming you are ok with it. An example is when doctors order breast biopsies, it happens a lot more than the cancer itself and based on a noisy technique of physical exam.