I think, from my experience, what they mean is tool use is as good as your model capability to stick to a given answer template/grammar. For example if it does tool calling using a JSON format it needs to stick to that format, not hallucinate extra fields and use the existing fields properly. This has worked for a few years and LLMs are getting better and better but the more tools you have, the more parameters your functions to call can have etc the higher the risk of errors. You also have systems that constrain the whole inference itself, for example with the outlines package, by changing the way tokens are sampled (this way you can force a model to stick to a template/grammar, but that can also degrade results in some other ways)
libraryofbabel|4 months ago