top | item 16595525

(no title)

Something I've been wondering about for a while now... Would it be possible to train an algorithm to identify similar characteristics of data from different schema? Looking at the actual data, I mean, and inferring a translation table or the like?

I have a background in data engineering and don't really know where I'd get started. But if you could figure out a way to throw differently schema'd data at an algorithm and have it try to create a universal schema, you'd be wealthy.

There would be a ton of challenges, but the problem you described seems like a societally valuable one to solve.

discuss

yourapostasy|8 years ago

That's pretty much a description of a chunk of process re-engineering gigs, and even with the data, metadata, domain experts, meetings, and documentation, the results are a struggle to obtain, so any solution will need to set some appropriate expectations. You might find some traction looking into automated ontology building, which has some promising tangents. But I think expecting any black box approach to yield high-fidelity results will end in tears, so there has to be more than just the schema'd data.

For some kinds of data like account numbers, phone numbers, names, and addresses, it is possible to envision some kind of algo. But that's not where the interesting action happens. After working with lots of application developers over the years, I've learned they cram the damndest bits of information into databases, and it's dangerous to make assumptions about how they use the data (whether temporally, structurally, computationally, *etc.). Sometimes with absolutely no rhyme or reason to what or why, either; the rationale and logic is embedded entirely in front of the database within the application code.

Without access to that code, even process re-engineering teams are flummoxed at just understanding a schema, not to speak of porting it. So I find it challenging to envision a scenario where inspecting only the data can yield better results; by the time we're restricting ourselves to only the data, and only at a snapshot in time, we've lost too many bits of information, and the resolution of the solution is too coarse-grained to be universally useful. As we add more inspection time, more fine-grained snapshots, and more kinds of data however, resolution goes up; the real question becomes just how much is needed to generate a MVP.