We benchmark on pre-2023 datasets of O(10M) documents not in our training set. Other detectors seem to have between 1-3% false positive rate and ours is around 1 in 10,000 as of our latest model update. We do a lot of active learning + core set selection to keep FPR low and improve recall on larger LLMs. Our white paper with some methodology is here: https://arxiv.org/abs/2402.14873
Wheatman|1 year ago
I noticed something, you achieved near 100% accuracy on most domains in every domain but scientific which made me wonder, how much is that could be due to how "strict" and "profrssional" these papers could be, or maybe how a slightly disproportionate number of the training data for these LLM could be from science based articles and papers, as they are generally viewed to be "high quality"
Interesting read either way, best of luck on your project (: