top | item 46089379

(no title)

maxspero | 3 months ago

I am not sure if you are familiar with Pangram (co-founder here) but we are a group of research scientists who have made significant progress in this problem space. If your mental model of AI detectors is still GPTZero or the ones that say the declaration of independence is AI, then you probably haven't seen how much better they've gotten.

This paper by economists from the University of Chicago economists found zero false positives of 1,992 human-written documents and over 99% recall in detecting AI documents. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5407424

discuss

nialse|3 months ago

Nothing points out that the benchmark is invalid like a zero false positive rate. Seemingly it is pre-2020 text vs a few models rework of texts. I can see this model fall apart in many real world scenarios. Yes, LLMs use strange language if left to their own devices and this can surely be detected. 0% false positive rate under all circumstances? Implausible.

maxspero|3 months ago

Our benchmarks of public datasets put our FPR roughly around 1 in 10,000. https://www.pangram.com/blog/all-about-false-positives-in-ai...

Find me a clean public dataset with no AI involvement and I will be happy to report Pangram's false positive rate on it.

pinkmuffinere|3 months ago

> Nothing points out that the benchmark is invalid like a zero false positive rate

You’re punishing them for claiming to do a good job. If they truly are doing a bad job, surely there is a better criticism you could provide.

bonsai_spool|3 months ago

                          EditLens (Ours)
                   Predicted Label
                Human     Mix       AI
             ┌─────────┬─────────┬─────────┐
       Human │  1770   │   111   │    0    │
             ├─────────┼─────────┼─────────┤
 True  Mix   │   265   │  1945   │   28    │
 Label       ├─────────┼─────────┼─────────┤
         AI  │    0    │   186   │  1695   │
             └─────────┴─────────┴─────────┘

It looks like 5% of human texts from your paper are marked as mixed, and mixed texts are 5-10% if mixed texts as AI, from your paper.

I guess I don’t see that this is much better than what’s come before, using your own paper.

Edit: this is an irresponsible Nature news article, too - we should see a graph of this detector over the past ten years to see how much of this ‘deluge’ is algorithmic error

lifthrasiir|3 months ago

It is not wise to brag about your product when the GP is pointing out that the article "reads like PR for Pangram", no matter AI detectors are reliable or not.

glenstein|3 months ago

I would say it's important to hold off on the moralizing until after showing visible effort to reflect on the substance of the exchange, which in this case is about the fairness of asserting that the detection methodology employed in this particular case shares the flaws of familiar online AI checkers. That's an importantly substantive and rebuttable point and all the meaningful action in the conversation is embedded in those details.

In this case, several important distinctions are drawn, including being open about criteria, about such things as "perplexity" and "burstiness" as properties being tested for, and an explanation of why they incorrectly claim the Declaration of Independence is AI generated (it's ubiquitous). So it seems like a lot of important distinctions are being drawn that testify to the credibility of the model, which has to matter to you if you're going to start moralizing.

rs186|3 months ago

The response would be more helpful if it directly addresses the arguments in posts from that search result.

maxspero|3 months ago

There are dozens of first generation AI detectors and they all suck. I'm not going to defend them. Most of them use perplexity based methods, which is a decent separators of AI and human text (80-90%) but has flaws that can't be overcome and high FPRs on ESL text.

https://www.pangram.com/blog/why-perplexity-and-burstiness-f...

Pangram is fundamentally different technology, it's a large deep learning based model that is trained on hundreds of millions of human and AI examples. Some people see a dozen failed attempts at a problem as proof that the problem is impossible, but I would like to remind you that basically every major and minor technology was preceded by failed attempts.

ugh123|3 months ago

How do you discern between papers "completely fabricated" by AI vs. edited by AI for grammar?

ThrowawayTestr|3 months ago

Are you concerned with your product being used to improve AI to be less detectable?

Majromax|3 months ago

> Are you concerned with your product being used to improve AI to be less detectable?

The big AI providers don't have any obvious incentive to do this. If it happens 'naturally' in the pursuit of quality then sure, but explicitly training for stealth is a brand concern in the same way that offering a fully uncensored model would be.

Smaller providers might do this (again in the same way they now offer uncensored models), but they occupy a miniscule fraction of the market and will be a generation or two behind the leaders.

maxspero|3 months ago

It's definitely going to be a back and forth - model providers like OpenAI want their LLMs to sound human-like. But this is the battle we signed up for, and we think we're more nimble and can iterate faster to stay one step ahead of the model providers.

jay_kyburz|3 months ago

I thought the author was attempting to highlight the hypocrisy of using an AI to detect other uses of AI, as if one was a good use, and the other bad.

interleave|3 months ago

Hi Max! Thank you for updating my mental model of AI detectors.

I was with total certainty under the impression that detecting AI-written text to be an impossible-to-solve problem. I think that's because it's just so deceptively intuitive to believe that "for every detector, there'll just be a better LLM and it'll never stop."

I had recently published a macOS app called Pudding to help humans prove they wrote a text mainly under the assumption that this problem can't be solved with measurable certainty and traditional methods.

Now I'm of course a bit sad that the problem (and hence my solution) can be solved much more directly. But, hey, I fell in love with the problem, so I'm super impressed with what y'all are accomplishing at and with Pangram!

moffkalast|3 months ago

I see the bullshit part continues on the PR side as well, not just in the product.