top | item 47034633

(no title)

I thought it would be obvious: OpenAI has used repos on GitHub as training data. Would be like testing someone using a past paper publicly available.

Are you planning on carrying out the experiment? Regardless of the outcome, it would be of value to developers.

discuss

simonw|13 days ago

Why wouldn't they train on Codeberg too?

It's pretty hard to block automated uses of "git clone".

netdevphoenix|12 days ago

Why would they? Github has 28 million public repos, Codeberg only hit 300k last year. Anyway, Codeberg was just a placeholder for 'repo source _less_ likely to be in their training data'. Codeberg was quick candidate for a place to find a big old codebase with non-sensitive data.

It is indeed hard but the guys at Codeberg are certainly an order of magnitude better than Github as they opted out of the main AI crawlers, regularly block IPs known to belong to AI startups and they allow you to make your repos only be accessible to logged in users.

You seem be going on a tangent, here. Main point was about performing a well documented test anyway.