top | item 45957667

(no title)

liotier | 3 months ago

"Brad Edwards" and "Bradley Edwards" might be the same individual.

discuss

tovej|3 months ago

Yes, the dataset also has three entries for Virginia Giuffre, "Virginia L. Giuffre", "Virginia Roberts Giuffre", and "Jane Doe Number 3 (Virginia Roberts)"

adolph|3 months ago

I read a recent observation that people subject to discovery are often making purposeful typos in key names in order for the communication to remain under the radar.

potato3732842|3 months ago

Everyone is potentially subject to discovery. Some people are just more aware of it.

GuinansEyebrows|3 months ago

Likewise for instances of "Larry" and "Lawrence" Summers... probably a lot of those.

DrewADesign|3 months ago

I’m sure some developer/archivist is working on a name authority as we speak.

cyrusradfar|3 months ago

great use case for using AI to suggest mergers and clean up.

specproc|3 months ago

LLMs are awful for this. I've got a project that's doing structured extraction and half the work is deduplication.

I didn't go down the route of LLMs for the clean up, as you're getting into scale and context issues with larger datasets.

I got into semantic similarity networks for this use case. You can do efficient pairwise matching with Annoy, set a cutoff threshold, and your isolated subgraphs are merger candidates.

I wrapped up my code in a little library if you're into this sort of thing.

github.com/specialprocedures/semnet