top | item 23462846

(no title)

Ah, I didn't see the N records originally. That makes sense!

Because I'm enthusiastic about data structures, and now that I've finished my work day, I thought I'd come back with a few numbers to support my earlier comment. The package size can be reduced by 90% by saving the names as compressed plaintext.

  5192 KB gender.json
  1484 KB gender.txt     (71% savings)
   488 KB gender.txt.gz  (90% savings)

On my reasonably modern laptop, it takes over 700 ms to unmarshal 5MB worth of JSON. But it takes less than 100 ms (85% time savings) to read the whole file and compare strings.

  $ time jq -r . >/dev/null gender.json
  real 0m0.764s
  
  $ time grep -E '^(NIC|PAUL)' gender.txt
  real 0m0.091s

When working with large datasets of items that are mostly the same size, sometimes it's useful to use fixed-width records to enable random indexing into the file. Of course, this is a small data set so it's not really worthwhile to pursue such optimizations. So just for fun, here's an analysis of how long the names are. 99% of names in this set are 13 characters or less. Representing short names as fixed-width records and long names as an appendix would use about 2.5 MB (50% savings compared to JSON).

PS. Here's how I prepared the text file:

  <gender.json jq -r '. | keys[]' > gender.txt

discuss

parthmaul|5 years ago

I stumbled along and made the changes you recommended! Seems to be working fine. Thanks again for the tip!