I think I understand, so you basically need in your training set anything that the production model would presumably see for it to work well? You can't just say here are a bunch of positives and a bunch of negatives, the negatives have to be actual things the model will see.
sanxiyn|5 years ago
This long penis is A. This short penis is A. This cat is B. This dog is B. Now, what is this ceiling?
The model, looking at the ceiling, discovers a long fluorescent tube. It is long! Neither cat nor dog is long (The model is yet to discover longcat), and while penis comes in long and short variety, all penises seem long-ish. The ceiling is A.
jononor|5 years ago
Adding common inputs to the training (or at least validation and test) sets is a good solution. Its hard data work, but will pay off. There are some techniques outside of closed-set classification that can help reduce the problems, or make the process of improving it more effective:
- Couple the classifier with a out-of-distribution (novelty/anomaly) detector. Samples that score high are considered "Unknown" and can be flagged for review. - Learn a distance metric for "nudity" instead of a classifier, potentially with unsupervised or self-supervised learning (no labels needed). This has higher chance of doing well on novel examples, but it still needs to be validated/monitored. - Use one-class classifier, trained only on positive samples of nudity. This has the disadvantage that novel nudity is very likely to be classified as "not nudity", which could be an issue.
sanxiyn|5 years ago