(no title)
n_u | 1 month ago
Here the distribution is all queries relevant to your product made by someone who would be a potential customer. Short and directly relevant queries like "running shoes" will presumably appear more times than much longer queries. In short, you can't possibly hope to generate the entire distribution, so you sample a smaller portion of it.
But in LLM SEO it seems that assumption is not true. People will have much longer queries that they write out as full sentences: "I'm training for my first 5k, I have flat feet and tore my ACL four years ago. I mostly run on wet and snowy pavement, what shoe should I get?" which probably makes the number of queries you need to sample to get a large portion of the distribution (40% from above) much higher.
I would even guess it's the opposite and the number of short queries like "running shoes" fed into an LLM without any further back and forth is much lower than longer full sentence queries or even conversational ones. Additionally because the context of the entire conversation is fed into the LLM, the query you need to sample might end up being even longer
for example: user: "I'm hoping to exercise more to gain more cardiovascular fitness and improve the strength of my joints, what activities could I do?"
LLM: "You're absolutely right that exercise would help improve fitness. Here are some options with pros and cons..."
user: "Let's go with running. What equipment do I need to start running?"
LLM: "You're absolutely right to wonder about the equipment required. You'll need shoes and ..."
user: "What shoes should I buy?"
All of that is to say, this seems to make AI SEO much more difficult than regular SEO. Do you have any approaches to tackle that problem? Off the top of my head I would try generating conversations and queries that could be relevant and estimating their relevance with some embedding model & heuristics about whether keywords or links to you/competitors are mentioned. It's difficult to know how large of a sample is required though without having access to all conversations which OpenAI etc. is unlikely to give you.
JimsonYang|1 month ago
But you do bring a good perspective because not all prompts are equal especially with personaliztion. So how do we solve that problem-I'm not sure. I have yet to see anything in the industry. The only thing that came close was when a security focused browser extension started selling data to aeo companies- that's how some companies get "prompt volume data".
n_u|1 month ago
I feel like without knowing the full distribution, it's really tough to know how many/what variations of the query/conversation you need to sample. This seems like something where OpenAI etc. could offer their own version of this to advertisers and have much better data because they know it all.
Interesting problem though! I always love probability in the real world. Best of luck, I played around with your product and it seems cool.