top | item 18348308

(no title)

kvb | 7 years ago

word2vec[0]:

      computer programmer
    - man
    + woman
    ---------------------
    = homemaker

Basilica?

[0] - https://arxiv.org/pdf/1607.06520.pdf

discuss

gojomo|7 years ago

Note that some of this research, especially early, overstated the 'bias' here because they didn't realize that the default 'analogy' routines specifically rule-out returning any word that was also in the prompt words. So, even if closest word-vector after the `man->woman` translation was the same role (as is often the case), you wouldn't see it in the answer.

Further, they cherry-picked the most-potentially-offensive examples, in some cases dependent on the increased 'fuzziness' of more-outlier tokens (like `computer_programmer`).

You can test analogies against the popular GoogleNews word-vector set here – http://bionlp-www.utu.fi/wv_demo/ – but it has this same repeated-word-suppression.

So yes, when you try "man : computer_programmer :: woman : _?_" you indeed get back `homemaker` as #1 (and `programmer` a bit further down, and `computer_programmer` nowhere, since it's filtered, thus unclear where it would have ranked).

But if you use the word `programmer` (which I believe is more frequent in the corpus than the `computer_programmer` bigram, and thus a stronger vector), you get back words closely-related to 'programmer' as the top-3, and 23 other related words before any strongly-woman-gendered professions (`costume_designer` and `seamstress`).

You can try lots of other roles you might have expected to be somewhat gendered in the corpus – `firefighter`, `architect`, `mechanical_engineer`, `lawyer`, `doctor` – but continue to get back mostly ungendered analogy-solutions above gendered ones.

So: while word-vectors can encode such stereotypes, some of the headline examples are not representative.

ben_w|7 years ago

One thing I’ve been tempted to research but never had time for myself: can one use that aspect of wording embeddings to automatically detect and quantify prejudice?

For example, if you trained only on the corpus of circia 1950 newspapers, would «“man” - “homosexual” ~= “pervert”» or something similar? I remember from my teenage years (as late as the 90s!) that some UK politicians spoke as if they thought like that.

I also wonder what biases it could reveal in me which I am currently unaware of… and how hard it may be to accept the error exists or to improve myself once I do. There’s no way I’m flawless, after all.

teraflop|7 years ago

> For example, if you trained only on the corpus of circia 1950 newspapers, would «“man” - “homosexual” ~= “pervert”» or something similar?

If it did, what conclusion would you be able to draw?

As far as I know, there's no theoretical justification for thinking that word vectors are guaranteed to capture meaningful semantic content. Empirically, sometimes they do; other times, the relationships are noise or garbage.

I am wholeheartedly in favor of trying to examine one's own biases, but you shouldn't trust an ad-hoc algorithm to be the arbiter of what those biases are.

pasabagi|7 years ago

I think this is a large part of what goes on in the digital humanities - to varying degrees of success. The problem, as usual, is not that there isn't an abundance of evidence. It's simply that nobody reads sociology papers except sociologists.

panarky|7 years ago

In this formulation wouldn't Basilica reflect the existing biases of the organization?

    resumes of candidates
  - resumes of employees you fired
  + resumes of employees you promoted
  ---------------------------------------
  = resumes of candidates you should hire

It's a lot of hard work to reduce bias in promotions and terminations.

Basilica might reinforce that hard work when evaluating candidates.

Or you could use the techniques described in your citation to allow Basilica to help de-bias the hiring process.