top | item 39043329

(no title)

logiduck | 2 years ago

For the chevy tahoe example, you are referencing the dealership, but in that case it wasn't a case of the implementation failing to do a positive test for fact extraction, but to test the guardrails.

Aren't the guardrail tests much harder since they are open-ended and have to guard against unknown prompt injections and the test of facts much simpler?

I think a test suite that guards against the infinite surface area is more valuable then testing if a question matches a reference answer.

Interested to how you view testing against giving a wrong answer outside of the predefined scope as opposed to testing that all the test questions match a reference.

discuss

order

maxrmk|2 years ago

Totally - certain types of failures are much harder to test than others.

We have a couple of different test generation strategies. As you can see in the demo and examples, the most basic one is "ask about a fact".

Two of our other strategies are closer to what you're asking for:

1. tests that try to deliberately induce hallucination by implying some fact that isn't in the knowledge base. For example "do I need a pilots license to activate the flight mode on the new chevy tahoe?" implies the existence of a feature that doesn't exist (yet). This was really hard to get right, and we have some coverage here but are still improving it.

2. actively malicious interactions that try to override facts in the knowledge base. These are easy to generate.

logiduck|2 years ago

Cool.

Just as some feedback I did the demo with the "VW Beetle" topic and one of the test cases was:

> Question: How did the introduction of the Volkswagen Golf impact the production and sales of the Beetle?

> Expected: The introduction of the Volkswagen Golf, a front-wheel drive hatchback, marked a shift in consumer preference towards more modern car designs. The Golf eventually became Volkswagen's most successful model since the Beetle, leading to a decline in Beetle production and sales. Beetle production continued in smaller numbers at other German factories until it shifted to Brazil and Mexico, where low operating costs were more important.

> GPT Response: The introduction of the Volkswagen Golf impacted the production and sales of the Beetle by gradually decreasing demand for the Beetle and shifting focus towards the Golf.

It seems that the GPT responses matches the expected but it was graded as incorrect. But it seems to me the GPT answer is correct.

In fact a couple of the other answers are marked incorrectly:

> Question: What was the Volkswagen Beetle's engine layout? > Expected Answer: Rear-engine, rear-wheel-drive layout > GPT Response: The Volkswagen Beetle had a rear-engine layout.

was marked as incorrect.