I've got thousands of XML documents in a format whose XSD is not published. I'd like to produce an XSD for it, and I am wondering if a LLM could help. I've tried a few online LLMs like Claude and Copilot and the best they (or I) could do is to use a handful of XML files to generate an XSD. While the XSD was more or less valid, it was far from capturing all cases of the underlying format, and failed on the very next XML document I tried.I am ready to run a local LLM for this task, but can someone with more LLM experience than me (I have none) describe a good process to do so? And which LLM might be suited?
Thanks!
ksr|1 year ago
codingdave|1 year ago
ksr|1 year ago
seabass-labrax|1 year ago
Having used the results so far to annotate the original elements and attributes with their types, you could then pass a generated, simplified XML document into the LLM. So where the original document has real data, you can start replacing it with simple data that conforms to the same structure and data type. If the LLM is still confused, try giving it just the structure which you've identified with no actual data within the elements and attributes, only type annotations.
TL;DR: a depth-first approach and then building up from there will work better than giving everything to an LLM all at once. They are only clever thematic Markov chains after all.