Facebooks crawls every page recorded by its tracking pixel
I find this highly concerning since:
1. they are crawling potentially sensitive information granted by links with tokens
2. they are triggering potentially harmful and/or confusing actions in your website by repeating links
3. they are repeating requests in a broken way by not encoding url-parameters correctly, for instance url-encoded %2B ends up just as a "+" thus becoming a whitespace (same goes for slashes etc.)
4. I could not find a warning or note on their tracking-pixel documentation that pages tracked would be crawled later
[+] [-] K0nserv|8 years ago|reply
Don't put Facebook tracking on sensitive pages. Actually as a service to your users don't put it anywhere where it doesn't add value.
> 2. they are triggering potentially harmful and/or confusing actions in your website by repeating links
They only perform idempotent[0]* requests which should not have any negative effect if performed multiple times
0: http://restcookbook.com/HTTP%20Methods/idempotency/
* They probably only actually perform GET in reality
[+] [-] cup-of-tea|8 years ago|reply
So don't put it anywhere.
[+] [-] d33|8 years ago|reply
[+] [-] naiveai|8 years ago|reply
> Again, this only applies to the result, not the resource itself. This still can be manipulated (like an update-timestamp, provided this information is not shared in the (current) resource representation.
This means that tracking is still could potentially affect some stuff, but honestly not by much.
[+] [-] pmlnr|8 years ago|reply
Don't put tracking on sensitive pages.
[+] [-] detaro|8 years ago|reply
Do your users, your broken software and yourself a favor and don't put Facebook tracking crap everywhere.
[+] [-] radicalbyte|8 years ago|reply
[+] [-] Quppa|8 years ago|reply
There's a bug report regarding the missing header here: https://developers.facebook.com/bugs/1654459311255613/
Unfortunately it seems impossible to get in touch with Facebook devs directly.
[+] [-] ikeboy|8 years ago|reply
[+] [-] AznHisoka|8 years ago|reply
[+] [-] gnud|8 years ago|reply
Now, if the crawler doesn't honor robots.txt, then you can complain (loudly).
[+] [-] slig|8 years ago|reply
Not their fault. GET requests should not modify anything.
[+] [-] throwaway2016a|8 years ago|reply
The summary of what most people are saying including some take aways:
- If you put something on the Internet it is public. Period. It is up to you to keep prying eyes away from that page. You can do that with strong mechanisms (like passwords and firewalls) or weak (like robots.txt) but you need to do something. You can't expect a page on the Internet to be private.
- Requests should never ever have anything sensitive in the query string. The query string is inherently logged. By your browser history, your web server, any tracking pixels like Facebook you put on the page, etc. If you absolutely must include a token in the URL (like with OAuth) make sure it is a temporary token and is immediately replaced with something more durable like a cookie or local storage, no unnecessary HTML is rendered, and the user is redirected to a new page that doesn't have it in the URL.
- GET requests should be idempotent. They should avoid changing any data as much as possible and should not have side effects. This is specified directly in the HTTP spec.
- If your page displayed sensitive data it should send the security tokens in a header field (like cookies or authentication). Users who hit the page without that header field should be responded to with a 404.
- Your point #3 is an add one. It is a bug on the Facebook side, yes, but it doesn't support your primary argument. In fact, if they fixed that bug it would make the perceived issues in your primary argument worse.
- Re #4 they don't need to warn you. See the first bullet. If it is on the internet it is public. Skype, Slack, Twitter, Google, all do the same thing.
[+] [-] Artemix|8 years ago|reply
Best solution is still to block Facebook's infrastructures, as always.
[+] [-] xstartup|8 years ago|reply
[+] [-] dotdi|8 years ago|reply
Abuse of power and shady tracking techniques by Facebook? Unheard of! </rant>
Seriously, this cannot be surprising after learning that the Messenger app listens to everything you do, all the time. That's just off the top of my head. They are doing this and much more.
[+] [-] threeseed|8 years ago|reply
Can you provide some evidence of this happening on Android ?
Also Facebook categorically denies this: http://www.bbc.com/news/technology-41776215
[+] [-] rocqua|8 years ago|reply
However, I've never seen a non-anecdotal source or even a source that gathers all anecdotes and gives a decent meta-analysis. Would you happen to have one?
[+] [-] rock_hard|8 years ago|reply
https://www.wired.com/story/facebooks-listening-smartphone-m...
[+] [-] VMG|8 years ago|reply
I would be surprised if service wouldn't index my site after I put one of their pixels on my site.
[+] [-] agopaul|8 years ago|reply
Also, the same crawler ignores the "User-agent: *" directive in the robots.txt file and you have to add specific rules for it: "User-agent: Adsbot-Google"
[+] [-] unicornporn|8 years ago|reply
Not surprising at all. Would be interesting to see a write up on this.
[+] [-] dspillett|8 years ago|reply
I would be more surprised to find out that they didn't crawl everything they can, specifically pages that invite them in.
> 1. they are crawling potentially sensitive information granted by links with tokens
If the page contains sensitive information you absolutely should not have code that you do not control (any code loaded from third party hosts, not just facebook's bits).
As a matter of security due diligence if you have third party hosted code linked into any such pages you should remove it with some urgency and carefully review the design decisions that lead to the situation. If you really must have the third party code in that area then you'll need to find a way of removing the need for the tokens being present.
Furthermore, if the information is sensitive to a particular user then your session management should not permit a request from facebook (or any other entity that has not correctly followed your authentication procedure) to see the content anyway.
> 2. they are triggering potentially harmful and/or confusing actions in your website by repeating links
Possibly true, but again that suggests a design flaw in the page in question. I assume that they are not sending POST or PUT requests? GET and HEAD requests should at very least be idempotent (so repeated calls are not a problem) and ideally lack any lasting side effect (with the exception of logging).
> 3. they are repeating requests in a broken way by not encoding url-parameters correctly
That does sound like a flaw, but one that your code should be immune to being broken by. Inputs should always be verified and action not taken unless they are valid. This is standard practise for good security and stability. The Internet is a public place, the public includes both deliberately nasty people and damagingly stupid ones so your code needs to take proper measures to not allow malformed inputs to cause problems.
You can't use "the page isn't normally linked from other sources so won't normally be found by a crawler" as a valid mitigation because the page could potentially be found by a malicious entity via URL fuzzing.
> 4. I could not find a warning or note on their tracking-pixel documentation that pages tracked would be crawled later
A warning would be nice, but again unless they explicitly say they won't do such things I would be surprised to find that they didn't not that they do.
[+] [-] eli|8 years ago|reply
[+] [-] dna_polymerase|8 years ago|reply
> 1. they are crawling potentially sensitive information granted by links with tokens
If tokens in GET params are your security concept: please leave the entire field.
2. they are triggering potentially harmful and/or confusing actions in your website by repeating links
So you built something that can be triggered by a simple HTTP request and may have a harmful potential? Wow.
3. they are repeating requests in a broken way by not encoding url-parameters correctly
You are kidding right? That's a problem to you? Either your Webserver drops these or your routes don't match, end of story.
4. I could not find a warning or note on their tracking-pixel documentation that pages tracked would be crawled later
Not a problem, you put it on the web and it will be crawled. Did you ever use Chrome? They report every URL you type to the Google Crawler. Read that anywhere lately?
[+] [-] rurounijones|8 years ago|reply
[+] [-] MattBearman|8 years ago|reply
[+] [-] kiloreux|8 years ago|reply
[+] [-] detritus|8 years ago|reply
Do you have a source for this? I Googled (!) and found this: https://www.stonetemple.com/google-chrome-discover-pages , which implies the opposite.
I don't use Chrome personally, but I do occasionally dump [none-too critical] preview files on open but otherwise 'hidden' urls on a domain for clients to view. I just find it easier for clients to deal with than inevitably lost passwords, etc, and tend to ask them to let me know when they're done so I can delete the folder.
I'd be interested to know whether their likely use of Chrome means that Google has a pattern of understanding of my domain space!
[+] [-] greenone|8 years ago|reply
- marketing wants some tracking, some developers adds it
- ecommerce websites in the real world tend to "need" these tracking/conversion codes
- you do have legitimate get-requests like password-reset links with tokens, also we do use payment providers who send the customers back to us with get links which include payment tokens, newsletter-unsubscribe links are also often simple token links
- and yes normally a get-request should not change anything (at least not when its just repeated) but the sheer fact that they have access to it _and_ are crawling it is bad
my point being that I find it that they would just crawl everything they recorded instead of just crawling pages which are linked publicly or which are targeted in ad-campaigns combined with the fact that they don't warn you about it
[+] [-] hoppelhase|8 years ago|reply
[+] [-] boraturan|8 years ago|reply
[+] [-] gaius|8 years ago|reply
[+] [-] zerostar07|8 years ago|reply
[+] [-] Angostura|8 years ago|reply
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] receptor|8 years ago|reply
[+] [-] dspillett|8 years ago|reply
While two wrongs don't make a right, assuming we accept that facebook is wrong in this instance which I don;t think I do, the code for the page handing out sensitive information to an unauthenticated request or taking action based on malformed inputs is negligent.
"Information wants to be free" is not just a hippie ideal it is a technical warning. Unless you take proper measures to control and protect sensitive data it will find a way out.
[+] [-] threeseed|8 years ago|reply
Just add a robots file or block the user agent with your firewall.
[+] [-] smt88|8 years ago|reply
Where do they draw the line? Why not run a keylogger through embedded like buttons and widgets? That sounds worse, but isn't all that much worse.