btn's comments

btn | 4 years ago | on: Reverse-Engineering Apple Dictionary (2020)

Another approach for this is to explore the format through Apple's tools for building dictionaries – as they provide a "Dictionary Development Kit" in Xcode's downloadable "Additional Tools" package (which has documentation for the XML format and a bunch of scripts/binaries for building the bundle).

I wound up doing this a while ago for a similar toy project. After some poking around, it turned out that dictionary bundles are entirely supported by system APIs in CoreServices! The APIs are private, but Apple accidentally shipped a header file with documentation for them in the 10.7 SDK [1]. You can load a dictionary with `IDXCreateIndexObject()`, read through its indices with the search methods (and the convenient `kIDXSearchAllMatch`), and get pointers to its entry data with `IDXGetFieldDataPtrs()`.

It takes a bit of fiddling to figure out the structure (there are multiple indices for headwords, search keywords, cross-references, etc., and the API is a general-purpose trie library) and request the right fields, but those property lists in the bundle are there to help! (As the author of this article discovered, the entries are compressed and are proceeded with a 4-byte length marker.)

[1] https://github.com/phracker/MacOSX-SDKs/blob/master/MacOSX10...

btn | 7 years ago | on: Loss aversion is not supported by the evidence

This article is essentially a press release for the author's own paper: https://onlinelibrary.wiley.com/doi/abs/10.1002/jcpy.1047

Which itself is a part of a series of articles in JCP debating the issue: https://onlinelibrary.wiley.com/doi/abs/10.1002/jcpy.1054

The definitive statement made by this article's headline isn't really supported by the evidence presented in the papers. Rather, the state of affairs seems to be that "loss aversion" has been the victim of incessant overgeneralisation. It's a very simple hypothesis about human behaviour that plays nicely into a lot of interesting (and therefore publishable) narratives. This has lead people to blindly accept the general hypothesis of loss aversion without enough critical investigation of its manifestation. The authors don't really refute "loss aversion" (i.e. they don't present an alternative theory to explain the papers that purport to demonstrate "loss aversion"), but rather they refute the pop-psychology belief that it's a general principle of human behaviour.

btn | 9 years ago | on: How Not to Explain Success

"Social science" isn't a singular amorphous blob, and these methods aren't uniformly accepted.

Online surveys are certainly becoming more popular as they are significantly cheaper to conduct than the alternatives, and yield publishable results that garner media attention. There are peer reviewers that will be sympathetic to these issues, regardless of the method's robustness.

However, there are others that would say this reeks of dredging (p-hacking) in a very murky pool of data. Their "scepticism" rarely makes the New York Times (or a bestselling book), though.

btn | 10 years ago | on: The Kolmogorov-Smirnov Test

This is a very nice review, but in practice I've found the K-S test to be much less useful than it initially appears:

1. Failing to reject the null hypothesis is not the same as accepting the null hypothesis. That is, concluding "these data are from some distribution X" is spurious.

2. There's a 'sweet-spot' for the amount of data. If you have too few samples, it's very easy to fail to reject; and if you have too many, it's very easy to reject (the chart at the bottom of the "Two Sample Test" section illustrates this).

3. The question "are these data from some distribution X?" is usually too strong. It's usually more informative to ask "can these data be modelled with some distribution X?"

btn | 11 years ago | on: Compressing Scrabble Dictionaries

The node packing format he describes sounds a bit like a LOUDS tree [1], which stores the structure of a tree as a bit array (each node as a '1' for each child, plus a '0'---for a total of 2n-1 bits for a tree of n nodes), and the data in a separate packed array. It can't represent the node-deduplication (nodes with multiple parents), but I think it gives comparable compression: for the full word list of 3,213,156 nodes, the tree structure is 6,426,311 bits (0.76MB), plus 3,213,156 bytes of character data---for 3.83MB total.

The downside is that traversing the tree is a series of linear bit-counting operations---which can be painfully show without a bit of pre-caching.

[1]: http://www.cs.cmu.edu/afs/cs.cmu.edu/project/aladdin/wwwloca...

btn | 12 years ago | on: Improving GitHub for science

Is that spelt out somewhere? The only mention of it I can find is in the confirmation email they sent: "will be free for the next two years".

btn | 12 years ago | on: Improving GitHub for science

Something they don't mention until after you've signed up: the micro plan only lasts for two years. I assume any private repositories will become locked if you don't pay for a subscription after that (as with regular accounts).

In comparison with BitBucket (not to advocate, but they offer a comparable service): the restrictions they waive for academic accounts are done so permanently.

btn | 12 years ago | on: JavaScript has a Unicode problem (2013)

are you sure that deleting the whole grapheme is actually what the Tamil or Korean user wants?

I'm not, but I think it's the only sane thing for a text editor to do if you don't want it to incorporate a ton of language-specific rules. The UAX actually does make a distinction between "legacy" and "extended" grapheme clusters---if you're handing "delete", you'll want to use "legacy clusters" to separate the two Tamil marks; but for text selection, use "extended clusters" will combine them (it's a little bit more complicated than that, but there are properties of Unicode that allow you to handle the "preferred" method for editing a script, while remaining mostly language-agnostic).

Hangul is trickier, but input happens through an IME that "composes" the characters before they are committed to the editor. The IME will perform component-wise deletion, but once it's committed, the editor will operate on the grapheme. It's not a perfect solution, but keeping the composition/decomposition rules for the language in the IME seems preferable.

btn | 12 years ago | on: JavaScript has a Unicode problem (2013)

Counting graphemes may be over-used, but needing to know their boundaries is important (and leads naturally to counting). For example, when you hit "delete" in a text editor, you'll probably want it to delete whole graphemes (and similarly for text selection); if you're doing text truncation, you may measure it by pixels, but you'll want to chop off the excess bytes at a grapheme boundary.

in the unlikely case I had to support Tamil or Korean for such a specialistic case.

Why is it "unlikely" that you would want your software to support users of other languages?

btn | 12 years ago | on: JavaScript has a Unicode problem (2013)

One of the major difficulties with Unicode handling is not just that there are poor implementations out there with legacy baggage, but a lot of poor advice as well (or well-meaning advice that seems correct, but misses some corner case or some language). For example, this article wants to count "graphemes", and the author goes through three versions of an algorithm to account for surrogate pairs and various combining marks. All seems well in the test cases the author shows, but combining marks are only one class of codepoints that can join to form a grapheme, and the algorithm will fail for other grapheme clusters such as 'நி' (Tamil letter NA + Tamil Vowel Sign I), or Hangul made of conjoining Jamo (such as '깍': 'ᄁ' + 'ᅡ' + 'ᆨ'), or other control characters.

Luckily, the Unicode Technical Committee has figured this out for you, and UAX#29 provides an algorithm for determining grapheme cluster boundaries [1]. Yes, it's long and technical, it has many cases (and exceptions) to handle, and it can't be expressed compactly in two lines of JavaScript; but it will give you a well-defined and understood answer for all scripts in Unicode.

[1] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Bounda...

btn | 12 years ago | on: How I “hacked” Kayak and booked a cheaper flight

No, your IP has no effect; by default, you're seeing prices as if they were booked from the departure city.

Matrix lets you specify the the "sales city" (the last field in the advanced search options), which allows you to check out price discrimination by location.

btn | 12 years ago | on: Microsoft's First Chip Brings Tank-Finding Design to Xbox

The original Kinect did not use time-of-flight technology, but projected a structured infrared light pattern and observed the displacement of the pattern to determine depth information. The Kinect that will ship with the Xbox One will use time-of-flight sensing (probably from the ZCam assets they bought with 3DV Systems).
page 1