It's surprising to me that there isn't a single positive comment in this thread considering how amazing this is. Sure, it has other implications, but were we really hoping to prevent computers from recognizing faces permanently?
The key innovation is an accurate, reliable method for rotating faces so they're 'looking straight at the camera' before feeding them to a deep neural network. They call this 3D photo rotation process "frontalization." Figure 1 on page 2 of the paper shows at a very high level how this is being done. Very nice!
Just a little background the paper itself doesn't provide:
The 3-d modeling and rotation is building on the work Yaniv did as part of Face.com (face recognition startup), which was acquired by Facebook.com. Studied here: http://vis-www.cs.umass.edu/lfw/results.html
Also Marc'Aurelio was just hired away from Google and is a deep learning expert.
Actually, that's only one of the contributions, and I'm not so sure it's the "key innovation". Every other recent face recognition method also tries to do some kind of alignment to make faces more similar in pose/expression/lighting prior to classifying them; and of these, several also fit faces to a 3-d model to rotate to frontal (with varying quality).
Yeah, generally speaking for object detection - Scale, position, rotation: typically you can identify something based on 2 of these really efficiently (for example, wavelets give you a transform that is sufficient for object identification under scale and position transformations in O(N) operations).
If you'd rather have orientation identification (ie. rotation angle) and scale in that mix but don't care about position, the radon transform is nice and easy to work with.
But beyond inverting 1-2 key transformations, one usually has to pay a pretty hefty computational cost which often precludes online (near real-time) use.
Actually, based on my reading of the paper, it seems that they learn a representation using one data set (the one with 1000 labeled samples per identity), then use that representation to classify on other training sets (like the Labeled Faces in the Wild, which has 13,323 photos of 5,749 celebrities). In fact, from what I can tell from section 5.1, they seemed to use face pairs (and so trained on 1 sample per person, and then tested on the other sample).
tl;dr: They don't need 1,000 labeled samples per identity (once done with the representation phase), and they achieved 97.25% accuracy on ~6,000 distinct identities, with only one training photo per identity.
Actually no. For one thing, this isn't exactly a stealthy or cheap thing to do. It involves datacenters full of computing resources even for 4000 identies.
I also don't believe it's so much a privacy issue. If I upload pictures to facebook I actually want them to be seen by human beings. The face-recognition only helps with that.
If facebook recognizes me on a picture someplace else, I actually rather want to know about it. I'm not super famous, so unexpected pictures of myself can be more of a bad thing...
I do. Assuming that Facebook uses it in what appears to be the logical choice (auto-tagging photos), this would be a fantastic way to find photos of me that I don't already know about.
Having worked on this problem before (the comparison to human performance they cite is from my work) and seeing all the recent successes of deep learning, I'd bet that a lot of the gain here comes from what deep learning generally provides: being able to leverage huge amounts of outside data in a much higher-capacity learning model.
Let me try to break this down:
In machine learning, when you have input data that is labeled with the kinds of things you are directly trying to classify, that is called "supervised". In this case it's not quite supervised, because their main evaluations are on the LFW dataset, which is a verification dataset, whereas their training on SFC is a recognition task. The difference is that in verification, you are given photos of two people you've never seen before and have to identify if they're the same or not. In recognition, you are given one or more photos of several people as training data, and asked to identify a new face as one of them. In theory, you could build recognition out of verification (verify all pairs between training images and test input images and assign the top-scored name as the person) but in practice it's much better to build dedicated recognition classifiers for each person.
Their main network is trained on a recognition task, using their SFC dataset. They show these recognition results in Table 1 and the middle column of Table 2. An error number of 8.74% (DF-4.4M), for example, means that they were able to successfully name the person in 91.26% of input images. However, this error rate crucially depends upon two key factors: (1) the number of people they're trying to distinguish between, and (2) the number of images they have per person. For this test, it was ~4,000 people, and ~1,000 images/persons, respectively.
If you were to add more people to the database, or have fewer images per person, this accuracy would drop. You can see this clearly in Table 1, where subsets DF-3.3M and DF-1.5M have correspondingly lower error rates because they have fewer people (3,000 and 1,500, resp). Similarly, the middle column of that table shows how error rates rise when you reduce the number of images per person.
In contrast, all subsequent results are shown on verification benchmarks (LFW and Youtube Faces). In large part, I suspect this is because of the realities of publishing in the academic face recognition literature: you have to evaluate on some dataset that the community is familiar with to get your paper accepted, and LFW is the de-facto standard these days, and it only does verification not recognition.
Here, their performance is certainly very good, and an improvement over previous work, but not an unexpectedly huge leap. If you look at the LFW results page, you can see that recent papers have been edging up to this number quite steadily: 95.17% (high-dim LBP), 96.33% (TL Joint Bayesian), 97.25% (this paper) http://vis-www.cs.umass.edu/lfw/results.html
Nevertheless, how are they able to get this boost in performance? What recent papers in this field have increasingly been discovering is that having higher-dimensional features can really give you a big boost, or to put it another way: having a higher-capacity model is what buys you the additional performance.
In machine learning, the "capacity" of a model refers (in a loose sense) to how powerful it is. The basic tradeoff is that a higher-capacity learner can more accurately classify testing data BUT it requires much more training data to learn. The problem is that for the LFW benchmark, the amount of direct training data you have is strictly limited: there are 6,000 pairs of faces, and you train on 90% of them and test on the remaining 10%. This is not nearly enough data to train a high-capacity model.
So what people have been doing is training the bulk of their models on some other data, for some other task, and then adapt that model to the LFW problem, using the LFW training data essentially to "tweak" the classification model for this particular task. That's why the LFW results tables are now broken up into different sections according to how much outside data was used and in what form.
In the case of DeepFace, this takes the form of the SFC dataset and learning a network for recognition, not verification. Since they have access to lots of data of this form, they can successfully train a high-capacity model for it. Then they simply "chop-off" the last layer of the network -- the one that does the final recognition task, and instead replace it with a component for verification using only LFW training data. Or for their "unsupervised" results, using no LFW training data ("unsupervised" in quotes because it's not really unsupervised).
BTW, this approach of training a deep network for some task, and then cutting off the last layer to apply it to a different task (in effect making it simply a feature-extraction method) is quite common, and has been applied successfully to many problems that might not have enough data to train a high-capacity model directly.
Anyway, if people have more questions, I can try and answer them. (I'm not one of the authors, but I am in the field.)
Thanks for the write-up. This is very informational.
Could you elaborate a bit more on the "capacity" of learning models? Can it be quantified and is it some how related to the VC dimension of a particular learning problem? It would be great if you could give some example of "capacity" for the more well known models: trees, naive bayes, SVM, one hidden layer neural nets, etc.
May be a stupid question - The Social Face Classification (SFC) dataset that they refer to - Is it published to the world? I wonder if they can deduce "emotions" from SFC dataset and use it as a training set for images in the wild.
Reducing error by 25% from a 96.5% baseline gives you their stated 97.25% accuracy. About 0.75% fewer errors. Still amazing, but less impressive than the abstract makes it sound.
Actually, 25% is the right way to judge this improvement. For example, let's say performance was currently at 99.9% and you improve it to 99.99%. That's not a 0.09% improvement (99.99 - 99.9), but rather a ten-fold improvement (.01% errors vs 0.1% errors).
This has to do with the fact that accuracy/errors are not linear.
[+] [-] scotth|12 years ago|reply
[+] [-] normloman|12 years ago|reply
[+] [-] reader5000|12 years ago|reply
[+] [-] cs702|12 years ago|reply
[+] [-] weezer|12 years ago|reply
The 3-d modeling and rotation is building on the work Yaniv did as part of Face.com (face recognition startup), which was acquired by Facebook.com. Studied here: http://vis-www.cs.umass.edu/lfw/results.html
Also Marc'Aurelio was just hired away from Google and is a deep learning expert.
[+] [-] apu|12 years ago|reply
See my other comment for my guess on what's actually providing the boost: https://news.ycombinator.com/item?id=7393378
[+] [-] thearn4|12 years ago|reply
If you'd rather have orientation identification (ie. rotation angle) and scale in that mix but don't care about position, the radon transform is nice and easy to work with.
But beyond inverting 1-2 key transformations, one usually has to pay a pretty hefty computational cost which often precludes online (near real-time) use.
[+] [-] fchollet|12 years ago|reply
- they still need 1000 labeled samples per identity
- their network can only handle 4000 distinct identities (at 97.25% accuracy) at a time
It's still a very worrying development for online and offline privacy.
[+] [-] timdumol|12 years ago|reply
tl;dr: They don't need 1,000 labeled samples per identity (once done with the representation phase), and they achieved 97.25% accuracy on ~6,000 distinct identities, with only one training photo per identity.
[+] [-] bayesianhorse|12 years ago|reply
I also don't believe it's so much a privacy issue. If I upload pictures to facebook I actually want them to be seen by human beings. The face-recognition only helps with that.
If facebook recognizes me on a picture someplace else, I actually rather want to know about it. I'm not super famous, so unexpected pictures of myself can be more of a bad thing...
[+] [-] phpnode|12 years ago|reply
[+] [-] bovermyer|12 years ago|reply
[+] [-] userbinator|12 years ago|reply
[+] [-] apu|12 years ago|reply
Let me try to break this down:
In machine learning, when you have input data that is labeled with the kinds of things you are directly trying to classify, that is called "supervised". In this case it's not quite supervised, because their main evaluations are on the LFW dataset, which is a verification dataset, whereas their training on SFC is a recognition task. The difference is that in verification, you are given photos of two people you've never seen before and have to identify if they're the same or not. In recognition, you are given one or more photos of several people as training data, and asked to identify a new face as one of them. In theory, you could build recognition out of verification (verify all pairs between training images and test input images and assign the top-scored name as the person) but in practice it's much better to build dedicated recognition classifiers for each person.
Their main network is trained on a recognition task, using their SFC dataset. They show these recognition results in Table 1 and the middle column of Table 2. An error number of 8.74% (DF-4.4M), for example, means that they were able to successfully name the person in 91.26% of input images. However, this error rate crucially depends upon two key factors: (1) the number of people they're trying to distinguish between, and (2) the number of images they have per person. For this test, it was ~4,000 people, and ~1,000 images/persons, respectively.
If you were to add more people to the database, or have fewer images per person, this accuracy would drop. You can see this clearly in Table 1, where subsets DF-3.3M and DF-1.5M have correspondingly lower error rates because they have fewer people (3,000 and 1,500, resp). Similarly, the middle column of that table shows how error rates rise when you reduce the number of images per person.
In contrast, all subsequent results are shown on verification benchmarks (LFW and Youtube Faces). In large part, I suspect this is because of the realities of publishing in the academic face recognition literature: you have to evaluate on some dataset that the community is familiar with to get your paper accepted, and LFW is the de-facto standard these days, and it only does verification not recognition.
Here, their performance is certainly very good, and an improvement over previous work, but not an unexpectedly huge leap. If you look at the LFW results page, you can see that recent papers have been edging up to this number quite steadily: 95.17% (high-dim LBP), 96.33% (TL Joint Bayesian), 97.25% (this paper) http://vis-www.cs.umass.edu/lfw/results.html
Nevertheless, how are they able to get this boost in performance? What recent papers in this field have increasingly been discovering is that having higher-dimensional features can really give you a big boost, or to put it another way: having a higher-capacity model is what buys you the additional performance.
In machine learning, the "capacity" of a model refers (in a loose sense) to how powerful it is. The basic tradeoff is that a higher-capacity learner can more accurately classify testing data BUT it requires much more training data to learn. The problem is that for the LFW benchmark, the amount of direct training data you have is strictly limited: there are 6,000 pairs of faces, and you train on 90% of them and test on the remaining 10%. This is not nearly enough data to train a high-capacity model.
So what people have been doing is training the bulk of their models on some other data, for some other task, and then adapt that model to the LFW problem, using the LFW training data essentially to "tweak" the classification model for this particular task. That's why the LFW results tables are now broken up into different sections according to how much outside data was used and in what form.
In the case of DeepFace, this takes the form of the SFC dataset and learning a network for recognition, not verification. Since they have access to lots of data of this form, they can successfully train a high-capacity model for it. Then they simply "chop-off" the last layer of the network -- the one that does the final recognition task, and instead replace it with a component for verification using only LFW training data. Or for their "unsupervised" results, using no LFW training data ("unsupervised" in quotes because it's not really unsupervised).
BTW, this approach of training a deep network for some task, and then cutting off the last layer to apply it to a different task (in effect making it simply a feature-extraction method) is quite common, and has been applied successfully to many problems that might not have enough data to train a high-capacity model directly.
Anyway, if people have more questions, I can try and answer them. (I'm not one of the authors, but I am in the field.)
[+] [-] boomzilla|12 years ago|reply
Could you elaborate a bit more on the "capacity" of learning models? Can it be quantified and is it some how related to the VC dimension of a particular learning problem? It would be great if you could give some example of "capacity" for the more well known models: trees, naive bayes, SVM, one hidden layer neural nets, etc.
[+] [-] somberi|12 years ago|reply
[+] [-] chriskanan|12 years ago|reply
Both methods use deep neural networks, but have a lot of differences, e.g., the Fan et al. paper doesn't use a 3D face model.
[+] [-] somberi|12 years ago|reply
[+] [-] beagle3|12 years ago|reply
[+] [-] DennisP|12 years ago|reply
[+] [-] stdbrouw|12 years ago|reply
[+] [-] apu|12 years ago|reply
This has to do with the fact that accuracy/errors are not linear.
[+] [-] joshgel|12 years ago|reply
[+] [-] visarga|12 years ago|reply