top | item 47049917

(no title)

I think that's a fear I have about AI for programming (and I use them). So let's say we have a generation of people who use AI tools to code and no one really thinks hard about solving problems in niche spaces. Though we can build commercial products quickly and easily, no one really writes code for difficult problem spaces so no one builds up expertise in important subdomains for a generation. Then what will AI be trained on in let's say 20-30 years? Old code? It's own AI developed code for vibe coded projects? How will AI be able to do new things well if it was trained on what people wrote previously and no one writes novel code themselves? It seems to me like AI is pretty dependent on having a corpus of human made code so, for example, I am not sure if it will be able to learn how to write very highly optimized code for some ISA in the future.

discuss

wreath|12 days ago

> Then what will AI be trained on in let's say 20-30 years? Old code? It's own AI developed code for vibe coded projects?

I’ve seen variation of this question since first few weeks /months after the release of ChatGPT and I havent seen an answer to this from leading figures in the AI coding space, whats the general answer or point of view on this?

righthand|12 days ago

The general answer is what they’re already doing: ignoring the facts and riding the wave.

lowbloodsugar|12 days ago

Is it hard to imagine that things will just stay the same for 20-30 years or longer? Here is an example of the B programming language from 1969, over 50 years ago:

  printn(n,b) {
   extrn putchar;
   auto a;

   if(a=n/b) /* assignment, not test for equality */
      printn(a, b); /* recursive */
   putchar(n%b + '0');
  }

You'd think we'd have a much better way of expressing the details of software, 50 years later? But here we are, still using ASCII text, separated by curly braces.

sockaddr|12 days ago

I suspect a more general and much more clever learning algorithm will emerge by then and will require less training data to get to a competent problem solving state faster even with dirty data. Something able to discriminate between novel information and junk. Until then I think there will be a quality decline after a few more years.

Eisenstein|11 days ago

I don't think that more code makes LLMs better at writing code past a certain point.

fastasucan|12 days ago

>So let's say we have a generation of people who use AI tools to code and no one really thinks hard about solving problems in niche spaces.

I don't think we need to wait a generation either. This probably was a part of their personality already, but a group of people developers on my job seems to have just given up on thinking hard/thinking through difficult problems, its insane to witness.

atomic128|12 days ago

Exactly. Prose, code, visual arts, etc. AI material drowns out human material. AI tools disincentivize understanding and skill development and novelty ("outside the training distribution"). Intellectual property is no longer protected: what you publish becomes de facto anonymous common property.

Long-term, this is will do enormous damage to society and our species.

The solution is that you declare war and attack the enemy with a stream of slop training data ("poison"). You inject vast quantities of high-quality poison (inexpensive to generate but expensive to detect) into the intakes of the enemy engine.

LLMs are highly susceptible to poisoning attacks. This is their "Achilles' heel". See: https://www.anthropic.com/research/small-samples-poison

We create poisoned git repos on every hosting platform. Every day we feed two gigabytes of poison to web crawlers via dozens of proxy sites. Our goal is a terabyte per day by the end of this year. We fill the corners of social media with poison snippets.

There is strong, widespread support for this hostile posture toward AI. For example, see: https://www.reddit.com/r/hacking/comments/1r55wvg/poison_fou...

Join us. The war has begun.

nz|10 days ago

Originally posted this comment here (https://news.ycombinator.com/item?id=47073581), but relevant to this subthread too.

The lesson that I am taking away from AI companies (and their billionaire investors and founders), is that property theft is perfectly fine. Which is a _goofy_ position to have, if you are a billionaire, or even a millionaire. Like, if property theft is perfectly acceptable, and if they own most of the property (intellectual or otherwise), then there can only be _upside_ for less fortunate people like us.

The implicit motto of this class of hyper-wealthy people is: "it's not yours if you cannot keep it". Well, game on.

(There are 56.5e6 millionaires, and 3e3 billionaires -- making them 0.7% of the global population. They are outnumbered 141.6 to 1. And they seem to reside and physically congregate in a handful of places around the world. They probably wouldn't even notice that their property is being stolen, and even if they did, a simple cycle of theft and recovery would probably drive them into debt).

tormeh|12 days ago

This will happen regardless. LLMs are already ingesting their own output. At the point where AI output becomes the majority of internet content, interesting things will happen. Presumably the AI companies will put lots of effort into finding good training data, and ironically that will probably be easier for code than anything else, since there are compilers and linters to lean on.

plagiarist|12 days ago

I was wondering if anyone was doing this after reading about LLMs scraping every single commit on git repos.

Nice. I hope you are generating realistic commits and they truly cannot distinguish poison from food.

plagiarist|12 days ago

AI will be trained on the code it wrote, our feedback on that code, and the final clean architecture(?) working(?) result after that feedback.

8note|12 days ago

i dont think we'll see that, because the niche spaces are people from other disciplines who do some code but dont consider coding important.

theyve already thought about it before reaching for code as a solution