I Have Blocked OpenAI

SOLAR_FIELDS|2 years ago

This is such a technopurist take. People who use LLM’s already know they can give wrong information. Your documentation won’t be able to cover every single possible contextual scenario that an LLM can help with. I think there are valid reasons to not allow OpenAI to spider you, but this is just a really silly one that feels pretty egotistical. People aren’t going to this guy saying “well OpenAI said your software works this way and it doesn’t”. It’s an entirely contrived scenario that doesn’t exist in reality.

EatingWithForks|2 years ago

> People who use LLM’s already know they can give wrong information

I think this is unfortunately much less true than expected... Lawyer using chatgpt.. teachers using chatgpt... even professors using chatgpt... as if its a source of truth.

gavinhoward|2 years ago

They're only not doing that because my software is not common yet. But look at GitHub issues for any semi-famous project, and you'll see a lot of questions about misunderstandings, and that's before LLM's poisoned everything.

superkuh|2 years ago

For the last two weeks my little webserver has been getting 200+ hits a day from bots with the useragent of anthropic-ai. At first it was what you'd expect, mirroring all the pdfs and such. But the last week it's just /robots.txt. 200+ times per day from amazon-ec2 so I have no way of knowing if it's actually anthropic-ai.

I was happy that they'd be including documents on topics I found interesting and things I wrote in the word adjacency training of their foundational model. That'd mean the model would be more useful to me. But the robots.txt stuff is weird. Maybe it's because I've had,

    User-agent: Killer AI
    Disallow: /~superkuh/

in there for the last 10 years? /s

sigilis|2 years ago

You should take down the documentation entirely, if you want to prevent incorrect interpretations of things. The LLMs won’t be the ones emailing you, the people who would get things wrong if the LLM provided some kind of confident wrong answer would probably simply not read your documentation as the vast majority of users do not. You’re just shifting some, but not all, misunderstandings into totally uninformed questions that will mean an additional email pointing them to RTFM.

All of these “we’re not letting bots crawl our site!” posts make me feel like I’ve travelled back in time to when having web spiders crawl your site was a big deal. You can’t really prevent people from using tools wrong, and it is odd that so many people care about this futile attempt to insulate yourself from stupid users that I managed to see it on the front page of HN.

The worst part is, if an LLM has already read in your docs and the interaction you fear your users having with LLMs comes to pass: they will have misapprehensions about the old version of your docs which will be even more wrong.

Allow me to prepare you for the future now before you have to hear it from someone else, you will be getting email spam about LLM Algorithm Optimization soon. LLMAO firms are probably already organizing around the bend of time, we’re just a little before they become visible.

gavinhoward|2 years ago

It's easier to put one link into an email than to try to explain things to people.

ljoshua|2 years ago

I agree that LLMs are almost more likely than not to answer documentation questions wrong, to hallucinate methods that don’t exist, or just be silly. But the value I see in allowing LLMs to train on documentation is in the glue code that an LLM could (potentially!) generate.

Documentation, even good docs, usually only answer the question “What does this method/class/general idea do?” Really good docs will come with some examples of connecting A and B. But they will often not include examples of connecting A to E when you have to transform via P because of business requirements, and almost never tell you how to incorporate third-party libraries X, Y, and Z.

As an engineer, I can read the docs and figure out the bits, but having an LLM suggest some of the intermediary or glue steps, even if wrong sometimes, is a benefit I don’t get only from good documentation.

unknown|2 years ago

[deleted]

Racing0461|2 years ago

unpopular opinion: llm responses being wrong is still valuable to me since it gives me a better jumping off point to exploring than nothing at all. especially with something like coding that can easily be back-propagated due to something not compiling/not working as intended. could be harmful in other areas tho.

gremlinsinc|2 years ago

yeah, if the LLM gives me 2 truths that are beyond the documentation, like an edge case or maybe an example in a better describes way for me to grok, and one false thing, usually the false thing is so bad i can tell it's false or it's truthy but the value from the two truths exceeds the negative value of the falsehood.

Generally speaking though you can also cut back on hallucination by asking for a source from a second LLM or using good retrospections and adding system messages to ensure if it doesn't know an answer to say so and not make one up.

Really, I think hallucination is the wrong word bullshitting or gaslighting might be better. You're asking it something and it thinks you want an answer any answer so if it doesn't know it makes it up. Similar to people who confess to crimes they didn't do because of distressful interrogation tactics.

gavinhoward|2 years ago

Author here.

My docs will include tutorial links at the top, and those tutorials will focus on accomplishing common tasks.

I believe that's a good jumping off point.

input_sh|2 years ago

> Despite the volume of documentation, my documentation would still be just a tiny blip in the amount of information in the LLM, and it will still pull in information from elsewhere to answer questions.

I sympathise. I've recently discovered that apparently I have enough Internet clout that ChatGPT knows about me. As in I can carefully construct a prompt and it will unmistakably reference me in particular. Don't even need to provide my name in the prompt.

Except, every fucking detail of what it "knows" about me is 100% false, and there's nothing I can do to correct it. I'm from a wrong country, I did things in my career that I absolutely didn't, etc.

Needless to say, I also blocked its crawler.

speedgoose|2 years ago

I understand that some people don’t want their work to train AI. Personally I like that the work I publish is not completely useless as it is at least used to train LLMs.

RecycledEle|2 years ago

We are all myopic in our own ways.

The guy who posted about blocking OpenAI so they will not answer questions about his software wrong (meaning not completely) ignores that his documentation is inaccessible to many less technically literate people. LLM AIs help bridge the gap to get newbies using software before they can understand the manuals.

NoZebra120vClip|2 years ago

When I entered college, my first Pascal course was on an SVR3 Unix system, and I read every manpage that I can find, because it was fantastic that I had access to that. Previously, I had read every shred of documentation of the Commodore 65xx systems, which generously included every technical detail possible. I mean, I had basically started on this in fifth grade. Reading manuals is how I gained my technical literacy.

gavinhoward|2 years ago

How do you know my docs are inaccessible to less technically literate people?

kristianp|2 years ago

IIRC LLMs also use common crawl data for training. Are they also blocking common crawl?

Another thing is that chatgpt 4 can do live retrieval of websites in response to users questions. That is a different crawler doing that I imagine. Are they going to block that too?

twelve40|2 years ago

This. Unfortunately, there is common crawl, there is bing and a million of other ways they could hide/get the data from. Or, just ignore robots.txt, it's not like it's a very honest or transparent operation they run there.

b800h|2 years ago

I bet information about his software is around elsewhere, and now ChatGPT will make up even more. I don't know how this is fixed. Structured queryable data, I guess.

gavinhoward|2 years ago

Author here.

You are correct, but if I demonstrate that I have done what I could to deny OpenAI access, and they still have it in their model, then I probably have more legal recourse against them.

Cantinflas|2 years ago

> But here’s the problem: it will answer them wrong.

There is no way to know that, and even if it ends up being true, blocking openai will likely make the problem worse, e.g. the ai answers will be worse without access to the documentation.

gavinhoward|2 years ago

But if people come to me with problems, I can give them a link to that post and say, "GPTx does not know my stuff. You will want to read the docs yourself."

pleoxy|2 years ago

Adding friction to the use of information about your product seems like a disservice to the users/customers.

By not having that information in the system at all will only degrade the answers. Not change who is asking.

dutchbrit|2 years ago

Just a thought that I have, wouldn’t it be better to block all robots and only to whitelist a select few? More AI bots are scraping now and in the future…

gavinhoward|2 years ago

Author here.

I wish I could, but I bet most would just ignore robots.txt.

zmnd|2 years ago

Can you give an example of what question someone asked GPT-4 and was misled? And how that question was better answered by one of your tutorials?

worrycue|2 years ago

I wonder if that would even help. If an LLM knows nothing at all about the software, it might just make up complete bullshit anyway.

orbit7|2 years ago

Does OpenAI also scan wayback machine? If it does and you are on that you may also wish to remove yourself.

gavinhoward|2 years ago

Dupe: https://news.ycombinator.com/item?id=37182366 .

55 comments