top | item 46701787

(no title)

n_u | 1 month ago

> wrap a small number of third-party ChatGPT/Perplexity/Google AIO/etc scraping APIs

Can you explain a little bit how this works? I'm guessing the third-parties query ChatGPT etc. with queries related to your product and report how often your product appears? How do they produce a distribution of queries that is close to the distribution of real user queries?

discuss

JimsonYang|1 month ago

How third parties query your product: For ChatGPT specifically, they open a headless browser, ask a question, and capture the results like the response and any citations. From there, they extract entities from the response. During onboarding I’m asked who my competitors are and the response is going to be recongized via the entities there. For example, if the query is “what are the best running shoes” and the response is something like “Nike is good, Adidas is okay, and On is expensive,” and my company is On, using my list of compeitotrs entity recognition is used to see which ones appear in the response in which order.

If this weren’t automated, the process would look like this: someone manually reviews each response, pulls out the companies mentioned and their order, and then presents that information.

2) Distribution of queries This is a bit of a dirty secret in the industry (intentional or not): usually what happens is you want to take snapshots and measure them overtime to get distribution. However a lot of tools will run a query once across different AI systems, take the results, and call it done.

Obviously, that isn’t very representative. If you search “best running shoes,” there are many possible answers, and different companies behave differently. What better tools do like Profound is run the same prompt multiple times. From my estimates, Profound runs up to 8 times. This gives a broader snapshot of what tends to show up everyday. You then aggregate those snapshots over time to approximate a distribution.

As a side note: you might argue that running a prompt 8 times isn’t statistically significant, and that’s partially true. However, LLMs tend to regress toward the mean and surface common answers over repeated runs and we found 8 times to be a good indicator- the level of completeness depends on the prompt(i.e. "what should i have for dinner" vs "what are good accounting software for startups", i can touch on that more if you want

n_u|1 month ago

As I understand, in normal SEO the number of unique queries that could be relevant to your product is quite large but you might focus on a small subset of them "running shoes" "best running shoes" "running shoes for 5k" etc. because you assume that those top queries capture a significant portion of the distribution. (e.g. perhaps those 3 queries captures >40% of all queries related to running shoe purchases).

Here the distribution is all queries relevant to your product made by someone who would be a potential customer. Short and directly relevant queries like "running shoes" will presumably appear more times than much longer queries. In short, you can't possibly hope to generate the entire distribution, so you sample a smaller portion of it.

But in LLM SEO it seems that assumption is not true. People will have much longer queries that they write out as full sentences: "I'm training for my first 5k, I have flat feet and tore my ACL four years ago. I mostly run on wet and snowy pavement, what shoe should I get?" which probably makes the number of queries you need to sample to get a large portion of the distribution (40% from above) much higher.

I would even guess it's the opposite and the number of short queries like "running shoes" fed into an LLM without any further back and forth is much lower than longer full sentence queries or even conversational ones. Additionally because the context of the entire conversation is fed into the LLM, the query you need to sample might end up being even longer

for example: user: "I'm hoping to exercise more to gain more cardiovascular fitness and improve the strength of my joints, what activities could I do?"

LLM: "You're absolutely right that exercise would help improve fitness. Here are some options with pros and cons..."

user: "Let's go with running. What equipment do I need to start running?"

LLM: "You're absolutely right to wonder about the equipment required. You'll need shoes and ..."

user: "What shoes should I buy?"

All of that is to say, this seems to make AI SEO much more difficult than regular SEO. Do you have any approaches to tackle that problem? Off the top of my head I would try generating conversations and queries that could be relevant and estimating their relevance with some embedding model & heuristics about whether keywords or links to you/competitors are mentioned. It's difficult to know how large of a sample is required though without having access to all conversations which OpenAI etc. is unlikely to give you.