top | item 23457239

(no title)

nic-waller | 5 years ago

Storing the lookup map on disk as a JSON-encoded dictionary seems less than optimal for package size and module load time. Two plaintext files (M.txt and F.txt) would be simple and more efficient on disk. The text is also highly compressible -- that could further reduce package size. These things might matter if the package is used in a Serverless environment.

Also, do you think there could be value in identifying classically androgynous names?

discuss

order

parthmaul|5 years ago

Thanks for sharing your feedback! Great idea on using .txt instead - I'll make a change for that. (My first time sharing a package I've prepared on github, so I'm a noob with that kind of stuff)

There are names in the current json file classified as "N" which stands for non-binary, but the frequency is quite low. "N" is based on if the frequency of "M" == "F" or if the frequencies are within a certain magnitude of each other. (magnitude calculation is based on proportions testing) With that being said, maybe it'd be worth adding functionality for a user to upload their own gender_lookup file?

nic-waller|5 years ago

Ah, I didn't see the N records originally. That makes sense!

Because I'm enthusiastic about data structures, and now that I've finished my work day, I thought I'd come back with a few numbers to support my earlier comment. The package size can be reduced by 90% by saving the names as compressed plaintext.

  5192 KB gender.json
  1484 KB gender.txt     (71% savings)
   488 KB gender.txt.gz  (90% savings)
On my reasonably modern laptop, it takes over 700 ms to unmarshal 5MB worth of JSON. But it takes less than 100 ms (85% time savings) to read the whole file and compare strings.

  $ time jq -r . >/dev/null gender.json
  real 0m0.764s
  
  $ time grep -E '^(NIC|PAUL)' gender.txt
  real 0m0.091s
When working with large datasets of items that are mostly the same size, sometimes it's useful to use fixed-width records to enable random indexing into the file. Of course, this is a small data set so it's not really worthwhile to pursue such optimizations. So just for fun, here's an analysis of how long the names are. 99% of names in this set are 13 characters or less. Representing short names as fixed-width records and long names as an appendix would use about 2.5 MB (50% savings compared to JSON).

  2 222
  3 1785
  4 8913
  5 27862
  6 45986
  7 44311
  8 28291
  9 14355
  10 7167
  11 4072
  12 2611
  13 1512
PS. Here's how I prepared the text file:

  <gender.json jq -r '. | keys[]' > gender.txt