top | item 23457944

(no title)

parthmaul | 5 years ago

Thanks for sharing your feedback! Great idea on using .txt instead - I'll make a change for that. (My first time sharing a package I've prepared on github, so I'm a noob with that kind of stuff)

There are names in the current json file classified as "N" which stands for non-binary, but the frequency is quite low. "N" is based on if the frequency of "M" == "F" or if the frequencies are within a certain magnitude of each other. (magnitude calculation is based on proportions testing) With that being said, maybe it'd be worth adding functionality for a user to upload their own gender_lookup file?

discuss

nic-waller|5 years ago

Ah, I didn't see the N records originally. That makes sense!

Because I'm enthusiastic about data structures, and now that I've finished my work day, I thought I'd come back with a few numbers to support my earlier comment. The package size can be reduced by 90% by saving the names as compressed plaintext.

  5192 KB gender.json
  1484 KB gender.txt     (71% savings)
   488 KB gender.txt.gz  (90% savings)

On my reasonably modern laptop, it takes over 700 ms to unmarshal 5MB worth of JSON. But it takes less than 100 ms (85% time savings) to read the whole file and compare strings.

  $ time jq -r . >/dev/null gender.json
  real 0m0.764s
  
  $ time grep -E '^(NIC|PAUL)' gender.txt
  real 0m0.091s

When working with large datasets of items that are mostly the same size, sometimes it's useful to use fixed-width records to enable random indexing into the file. Of course, this is a small data set so it's not really worthwhile to pursue such optimizations. So just for fun, here's an analysis of how long the names are. 99% of names in this set are 13 characters or less. Representing short names as fixed-width records and long names as an appendix would use about 2.5 MB (50% savings compared to JSON).

PS. Here's how I prepared the text file:

  <gender.json jq -r '. | keys[]' > gender.txt

parthmaul|5 years ago

I stumbled along and made the changes you recommended! Seems to be working fine. Thanks again for the tip!