(no title)
nic-waller | 5 years ago
Because I'm enthusiastic about data structures, and now that I've finished my work day, I thought I'd come back with a few numbers to support my earlier comment. The package size can be reduced by 90% by saving the names as compressed plaintext.
5192 KB gender.json
1484 KB gender.txt (71% savings)
488 KB gender.txt.gz (90% savings)
On my reasonably modern laptop, it takes over 700 ms to unmarshal 5MB worth of JSON. But it takes less than 100 ms (85% time savings) to read the whole file and compare strings. $ time jq -r . >/dev/null gender.json
real 0m0.764s
$ time grep -E '^(NIC|PAUL)' gender.txt
real 0m0.091s
When working with large datasets of items that are mostly the same size, sometimes it's useful to use fixed-width records to enable random indexing into the file. Of course, this is a small data set so it's not really worthwhile to pursue such optimizations. So just for fun, here's an analysis of how long the names are. 99% of names in this set are 13 characters or less. Representing short names as fixed-width records and long names as an appendix would use about 2.5 MB (50% savings compared to JSON). 2 222
3 1785
4 8913
5 27862
6 45986
7 44311
8 28291
9 14355
10 7167
11 4072
12 2611
13 1512
PS. Here's how I prepared the text file: <gender.json jq -r '. | keys[]' > gender.txt
parthmaul|5 years ago