(no title)
parthmaul | 5 years ago
There are names in the current json file classified as "N" which stands for non-binary, but the frequency is quite low. "N" is based on if the frequency of "M" == "F" or if the frequencies are within a certain magnitude of each other. (magnitude calculation is based on proportions testing) With that being said, maybe it'd be worth adding functionality for a user to upload their own gender_lookup file?
nic-waller|5 years ago
Because I'm enthusiastic about data structures, and now that I've finished my work day, I thought I'd come back with a few numbers to support my earlier comment. The package size can be reduced by 90% by saving the names as compressed plaintext.
On my reasonably modern laptop, it takes over 700 ms to unmarshal 5MB worth of JSON. But it takes less than 100 ms (85% time savings) to read the whole file and compare strings. When working with large datasets of items that are mostly the same size, sometimes it's useful to use fixed-width records to enable random indexing into the file. Of course, this is a small data set so it's not really worthwhile to pursue such optimizations. So just for fun, here's an analysis of how long the names are. 99% of names in this set are 13 characters or less. Representing short names as fixed-width records and long names as an appendix would use about 2.5 MB (50% savings compared to JSON). PS. Here's how I prepared the text file:parthmaul|5 years ago