What makes this extra ridiculous is the fact LinkedIn built its business on scraping not publicly available information but private address books of unsuspecting users.
This is incredibly important. If you dig deep into why LinkedIn is behaving the way it is, it is definitely not an attempt into protecting users' privacy. It's all about maintaining and expanding the ways it can monetize the data that users provide.
This is the type of thing that we risk loosing as the internet matures and internet companies with vested interests gain more power. Setting this type of precedents will absolutely curtail innovation and freedom in the future. Think about it, would Google have been created in an environment that is overwhelmingly siloed and filled with red tape?
I see parallels to the net neutrality discussion in this.
Access that does not require authentication should never be a crime. If LinkedIn wants the courts to intervene, they must require authentication for their data. If they also want Google to scrape their site, they must require Googlebot to authenticate itself.
> Access that does not require authentication should never be a crime.
Careful, this could legitimize things like accidental denial of service. Depending on circumstances, even basic scraping could cause problems.
(I need to be vague to avoid violating an NDA.) A major internet site had a URL that went something like somedomain/group?id=xxxxx. It turns out that a simple scraper, that called id=1, id=2, id=3, ect, ect, caused a major problem! This was because rendering these pages required significant resources; so most active pages were kept in RAM. Of course, the scraper tried to read everything.
Of course, no one thought the scraper was malicious in any way!
It seems AdWeek can distinguish a "good bot" from a "bad bot" irrespective of the behavior of the user^W bot, i.e., whether it is one single HTTP request or 10,000 consecutive requests is irrelevant.
How do they do it?
Pattern match against the User-Agent string.
Effective shibboleth.^W engineering.
Clarification: If a user, not a "bot", makes the "wrong" choice of user-agent string (e.g. in the browser settings), then they will be labeled a "bad bot", even if their behavior is no different than other users who are not labeled "bad bots". For example, they make one HTTP GET request just like any other user. There are databases of "acceptable" user-agent strings available to anyone. If still unsure about the point I am making, see this post from several days ago: https://www.sigbus.info/software-compatibility-and-our-own-u...
Hm, I disagree. Either information is public, no matter for who. Or the information is private, and you should have ACL for accessing the information. I don't think it's fair to say that information is public if you're a human but private if you're a machine, or vice versa.
It's not about if it's difficult to build but rather the principle behind if you can just allow humans to read something.
For the most part I agree, but I feel there are grey areas. Things like web browsers (which are not robots) can access the content as though they are from a human. But what about extensions or apps that do things in the background, such as caching the contents of several pages for offline viewing. Is that now considered a bot.
The robotstxt.org site states that a robot "should" obey the rules. "should" is not a legal term that implies compliance. "must" would have been more appropriate to indicate enforcement.
That file includes at least two non-standard syntax extensions[0]. Robots is just a de facto standard and respect of some directives varies[1]. So much for it being 'not difficult' while the task is not even clear because there isn't even a clear standard.
Archive.org also dislikes how robots.txt is being used mainly for search engines and goes against their mission in particular[2]. Are they now hackers for not throwing away information just because someone was overzealous with robots.txt or retired a certain website and uses robots.txt as SEO to let another one take its place in Google search results?
If some big corp wants to cry and bring legal matters into software they should first be accountable themselves for not securing themselves and the data of their clients (see the LinkedIn hack people mentioned elsewhere here and in general the high profile hacks like Equifax, Sony, etc.). Or should software shape up to be like many other areas today are - multi-million corporations are free to play fast and loose and endanger people while small guys get fried over meaningless bullshit and vaguely defined "crimes".
>robots.txt compliance is not difficult to build. I'm fine with robots.txt violations being considered hacking.
I'm not. You can set up a server to serve different versions of robots.txt to different folks. A malicious actor could deliberately feed inputs to a specific crawler that convince it to violate the terms of the robots.txt it serves to everyone else, and then press for criminal charges against the operator of the scraper.
In a sufficiently adversarial relationship, this lets website owners turn any well-behaved site scraper into criminal activity. That's not a power we want to grant.
>I'm fine with robots.txt violations being considered hacking.
Okay. Start with something simple then - how would you define a "bot" and thus subject to your robots.txt rule?
Is my web-browser a bot? What about a proxy? What about a deaf persons screen reader?
If my web-browser pre-fetches links near my mouse pointer, is that a bot? What if it downloads the whole of an article split over, say, ten pages?
I think of robots.txt similar to posting a "No Trespassing" sign. For a private residence, it's almost not even required, yet for something like a shopping mall during opening hours, the default assumption is that anyone is allowed to be there without a specific invitation, until they are expressly asked to leave and not come back.
Trying to nail down the exact line is a tough issue.
I don't know. The pathological case could include a rapidly changing robots.txt. Think about archive.org's policy. If they suddenly find new restrictions on a domain, they hide it in their waybackmachine. Sometimes an old site will go down and be replaced by totally new owners. This breaks some domains of the waybackmachine retroactively.
Honoring the robots.txt file is voluntary and ignoring it should in no way be considered hacking. I would go so far as to say that any activity that someone could engage in, simply by loading a URL, should in no way be considered hacking.
Not only does it make it way too easy to prosecute software developers, it really devalues the term "hacking".
I was about to agree that robots.txt prohibitions should be considered a form of authorization.
But I think what is being argued is that "if it's publicly available on a URL, it's available for any client to download and use." I think the latter argument holds more water, as since they are making it publicly available it is implicit authorization.
If robots.txt allows Google and Bing but nobody else, it should be ignored. If it blocks everyone, then I agree. We need to make sure that the next Google has a chance to succeed.
Interesting. They say that crawling is prohibited there, actually, and have a blanked 'Disallow' at the end.
# Notice: The use of robots or other automated means to access LinkedIn without
# the express permission of LinkedIn is strictly prohibited.
...
User-agent: *
Disallow: /
All the listed bots are only able to access a small subset of pages, the same for each bot apart from one. The 'deepcrawl' bot is privileged, and gets to see the '/profinder' pages, for some reason?
How about: if you want me not to scrape it, keep it off my internet??
Actually I'm considering building "API-fication" of websites with bindings for major languages (Java, Python, JS). With luck websites could participate by providing & maintaining a parseable API-sitemap.
This would open door to my 2nd project: orchestration a-la BPEL on top of websites. visual editor, macros, scripting. Call this PIPES 2.0
I mentioned this before in a previous thread on this topic, but I can't support the EFF on this. This is, at the end, an argument against control over ones own data: LinkedIn might be doing sketchy things with your data, but it's all stuff you voluntarily agreed to in exchange for their service. If any shady data aggregator can vacuum it up and do whatever, I didn't consent to that and I'm not getting any benefit from it. The EFF shouldn't be defending that right.
But the EFF isn't arguing that any shady aggregator should be able to vacuum up anything. LinkedIn would still have the full right and ability to implement limits, blocks, or so on to prevent this. LinkedIn could still make it against their terms of service and pursue a civil suit. It just would stop LinkedIn from being able to pursue felony hacking prosecutions against people for accessing a public webpage with a script.
For real: I really hate corporations 'stealing' data from my phone. For example Google likes to introduce new sync options to Android and every time they do so it is activated by default. So as soon as the update arrives their software syncs my data to their servers without my consent. They probably have some clause in the EULA but as a user of their products I really hate that behavior. A similar case is not being able to disable address book sync before it syncs the for the first time.
Those things should be crimes as the data they fetch is not publicly available on some web page but exists only on my personal device and they take it without my consent.
How does a website put reasonable limits on access?
I'm not saying what Linkedin is trying to do is right but it seems to me there needs to be a way to say "Dude, that's not cool." A regular B&M store can refuse service to disruptive people and trespass people who don't comply, why not servers?
--edit--
Pretty much what rayiner is saying, they posted while I was typing.
> How does a website put reasonable limits on access?
1) Blocking TCP connections
2) Returning a 4XX error, perhaps even "401 Authorization Required", "402 Payment Required", "403 Forbidden", or "429 Too Many Requests"
> A regular B&M store can refuse service to disruptive people and trespass people who don't comply, why not servers?
A Brick and Mortar store has to _tell_ you you're being banned. The mechanisms I listed above both tell you and lock the door whenever you attempt to access.
Edit: In this case, it's more like someone was looking in the store window from the public sidewalk and asked to stop. Can you really ask someone to stop looking at you from a public place?
They have many options. They can rate limit access by IP address, they can keep information they'd like not to be scraped behind login screens. And so on.
I think scraping for personal use (not honorig robots.txt) should always be legal unless you are attempting DOS. You are accessing public information, the server is returning HTTP200 and it doesn't matter if you do so using a browser, phantomjs or curl with -A parameter.
A different situation would be scraping a website to make business. Worst being directly using the data - for example those StackOverflow clones with original data doesn't sound ok to me. I am not sure what to think about bots doing various derived work like stats and analysis. I think that if they are part of a business, making money, it shouldn't be legal unless those request are permitted by robots.txt.
Question. How this principle can coexist with the idea of "surveillance is bad"? Because that's mostly to collect publicly available information. Is it bad because it's done by a government? It's possible to set up a bunch of privately owned cameras in a city and keep filming people. Is it the association of infos that makes it bad and not mere collection? Is it okay if it doesn't have a personally identifiable information (but who knows what one can make out of them)? I don't know what I should think of this.
This thought process always bewilders me. Whenever it comes up that government agencies monitor our emails and phone calls, someone, as if on cue, always pipes up that that's totally no different from people posting on their Facebook timeline and other absolutely mind-bogglingly bad equivalences.
You, however, go the extra mile, here. How about you explain exactly how accessing published information on a public website is like building a network of cameras to monitor a city with?
Can data that is supplied with an intention to be publicly accessible i.e. public domain be restricted. If the public was asked, "When you supplied your picture, your name, and then created a public URL to become fully searchable, was your intention that that information was to be restricted or was your intention that this was information you publicized about yourself to make it possible for potential employers to find you?". Answer, "Yes, it was 100% my intention to become searchable so that employers would be able to seek me out". Conversation is over.
LinkedIn creates an implied covenant with public consent (mostly) to then publish and make discoverable their professional profiles.
While LinkedIn 100% should have the right to stop others from embedding without permission since it's possible to claim the data structure and presentation is proprietary to them, this should never extend to the actual data itself, since this was willing gifted by the actual owners (Joe public) into public domain.
I think an argument could be made that LinkedIn is being burdened with a degree of data mining that affects their business and therefore should be able to charge a minimal fee e.g. an API firehose to acquire the data in bulk from providers in an raw data stream.
That seems reasonable depending on the charges associated with that offer, this would be the correct compromise, since their data structure is all that actually separates their service from say About.me or any other site of that type. All of which don't disallow scraping; as long as it doesn't present as a DOS attack (of course).
Anyway my comments are as a marketer and not a programmer or lawyer, but personally I'm very interested to see this case resolved in a manner that doesn't suit LinkedIn in the slightest.
There is a difference between public property and private property that is made available to the public. Just because the cafe on the corner has its door open and lets you stroll in off the street doesn't mean that the property owner doesn't retain the right to exclude people. And if the property owner revokes your permission, then going onto the property again can be a crime (trespass).[1]
Servers are no different. The Internet isn't an abstraction--it's just pieces of private property connected together (servers, routers, switches). When you make an HTTP request, you're accessing a piece of private property. The owner of that property has every right to decide not to let you do so.
> LinkedIn argues that imposing criminal liability for automated access of publicly available LinkedIn data would protect the privacy interests of LinkedIn users who decide to publish their information publicly, but that’s just not true
Protect them from what, your unlocked front door? [0][1]
I'm doing QA to validate information collected by my recruiting company, both acting within Linkedin's terms of service for a paid subscription, and violating their terms of server by improving my own company's process. Like the article said: Linkedin wants to participate in an open internet and also abuse CFAA.
Yes, it's not a crime previously for both the providers and the viewers; however, viewers unofficially announce or utility public information is plagiarism
Let us remember here that Microsoft owns LinkedIn. There's been a lot of love for Microsoft here recently (I'm among the many who are liking the 'new' MS). No doubt, this is quite a separate group to those doing OSS/Linux/Python/Jupyter/etc, but it's worth pausing to think about what a move like this says about their overarching corporate strategies.
Shit like this has been LinkedIn's modus operandi since day one, not to mention their own questionable ethics. It has little or nothing to do with them now being a subsidiary of Microsoft.
[+] [-] us0r|8 years ago|reply
[+] [-] laderach|8 years ago|reply
This is the type of thing that we risk loosing as the internet matures and internet companies with vested interests gain more power. Setting this type of precedents will absolutely curtail innovation and freedom in the future. Think about it, would Google have been created in an environment that is overwhelmingly siloed and filled with red tape?
I see parallels to the net neutrality discussion in this.
[+] [-] serpentstar|8 years ago|reply
[deleted]
[+] [-] ynniv|8 years ago|reply
[+] [-] gwbas1c|8 years ago|reply
Careful, this could legitimize things like accidental denial of service. Depending on circumstances, even basic scraping could cause problems.
(I need to be vague to avoid violating an NDA.) A major internet site had a URL that went something like somedomain/group?id=xxxxx. It turns out that a simple scraper, that called id=1, id=2, id=3, ect, ect, caused a major problem! This was because rendering these pages required significant resources; so most active pages were kept in RAM. Of course, the scraper tried to read everything.
Of course, no one thought the scraper was malicious in any way!
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] mgalka|8 years ago|reply
Linkedin is not trying to prevent access. They want to prevent information from being scraped, and then used to their detriment.
[+] [-] feelin_googley|8 years ago|reply
This is an article about the LinkedIn v hiQ case at AdWeek.
It seems AdWeek can distinguish a "good bot" from a "bad bot" irrespective of the behavior of the user^W bot, i.e., whether it is one single HTTP request or 10,000 consecutive requests is irrelevant.How do they do it?
Pattern match against the User-Agent string.
Effective shibboleth.^W engineering.
Clarification: If a user, not a "bot", makes the "wrong" choice of user-agent string (e.g. in the browser settings), then they will be labeled a "bad bot", even if their behavior is no different than other users who are not labeled "bad bots". For example, they make one HTTP GET request just like any other user. There are databases of "acceptable" user-agent strings available to anyone. If still unsure about the point I am making, see this post from several days ago: https://www.sigbus.info/software-compatibility-and-our-own-u...
[+] [-] alexdoma|8 years ago|reply
[+] [-] someonewithpc|8 years ago|reply
[+] [-] ikeboy|8 years ago|reply
You mean, bots that obey robots.txt?
https://www.linkedin.com/robots.txt very specifically prohibits scraping by any bot besides a small whitelist.
robots.txt compliance is not difficult to build. I'm fine with robots.txt violations being considered hacking.
[+] [-] diggan|8 years ago|reply
Hm, I disagree. Either information is public, no matter for who. Or the information is private, and you should have ACL for accessing the information. I don't think it's fair to say that information is public if you're a human but private if you're a machine, or vice versa.
It's not about if it's difficult to build but rather the principle behind if you can just allow humans to read something.
[+] [-] martin-adams|8 years ago|reply
The robotstxt.org site states that a robot "should" obey the rules. "should" is not a legal term that implies compliance. "must" would have been more appropriate to indicate enforcement.
[+] [-] FRex|8 years ago|reply
Archive.org also dislikes how robots.txt is being used mainly for search engines and goes against their mission in particular[2]. Are they now hackers for not throwing away information just because someone was overzealous with robots.txt or retired a certain website and uses robots.txt as SEO to let another one take its place in Google search results?
If some big corp wants to cry and bring legal matters into software they should first be accountable themselves for not securing themselves and the data of their clients (see the LinkedIn hack people mentioned elsewhere here and in general the high profile hacks like Equifax, Sony, etc.). Or should software shape up to be like many other areas today are - multi-million corporations are free to play fast and loose and endanger people while small guys get fried over meaningless bullshit and vaguely defined "crimes".
[0] - https://en.wikipedia.org/wiki/Robots_exclusion_standard#Nons...
[1] - https://intoli.com/blog/analyzing-one-million-robots-txt-fil...
[2] - https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
[+] [-] ThrustVectoring|8 years ago|reply
I'm not. You can set up a server to serve different versions of robots.txt to different folks. A malicious actor could deliberately feed inputs to a specific crawler that convince it to violate the terms of the robots.txt it serves to everyone else, and then press for criminal charges against the operator of the scraper.
In a sufficiently adversarial relationship, this lets website owners turn any well-behaved site scraper into criminal activity. That's not a power we want to grant.
[+] [-] alasdair_|8 years ago|reply
Okay. Start with something simple then - how would you define a "bot" and thus subject to your robots.txt rule?
Is my web-browser a bot? What about a proxy? What about a deaf persons screen reader?
If my web-browser pre-fetches links near my mouse pointer, is that a bot? What if it downloads the whole of an article split over, say, ten pages?
I think of robots.txt similar to posting a "No Trespassing" sign. For a private residence, it's almost not even required, yet for something like a shopping mall during opening hours, the default assumption is that anyone is allowed to be there without a specific invitation, until they are expressly asked to leave and not come back.
Trying to nail down the exact line is a tough issue.
[+] [-] rhema|8 years ago|reply
[+] [-] cmiles74|8 years ago|reply
Not only does it make it way too easy to prosecute software developers, it really devalues the term "hacking".
[+] [-] mcguire|8 years ago|reply
[+] [-] bambax|8 years ago|reply
Really?? That would mean private corporations, or private citizens, can write laws.
[+] [-] cabaalis|8 years ago|reply
But I think what is being argued is that "if it's publicly available on a URL, it's available for any client to download and use." I think the latter argument holds more water, as since they are making it publicly available it is implicit authorization.
[+] [-] hardtke|8 years ago|reply
[+] [-] grkvlt|8 years ago|reply
[+] [-] anilgulecha|8 years ago|reply
[+] [-] decasteve|8 years ago|reply
[+] [-] paulus_magnus2|8 years ago|reply
Actually I'm considering building "API-fication" of websites with bindings for major languages (Java, Python, JS). With luck websites could participate by providing & maintaining a parseable API-sitemap.
This would open door to my 2nd project: orchestration a-la BPEL on top of websites. visual editor, macros, scripting. Call this PIPES 2.0
[+] [-] Analemma_|8 years ago|reply
[+] [-] bo1024|8 years ago|reply
[+] [-] guywaffle|8 years ago|reply
[+] [-] JepZ|8 years ago|reply
Those things should be crimes as the data they fetch is not publicly available on some web page but exists only on my personal device and they take it without my consent.
[+] [-] UncleEntity|8 years ago|reply
I'm not saying what Linkedin is trying to do is right but it seems to me there needs to be a way to say "Dude, that's not cool." A regular B&M store can refuse service to disruptive people and trespass people who don't comply, why not servers?
--edit--
Pretty much what rayiner is saying, they posted while I was typing.
[+] [-] jimktrains2|8 years ago|reply
1) Blocking TCP connections
2) Returning a 4XX error, perhaps even "401 Authorization Required", "402 Payment Required", "403 Forbidden", or "429 Too Many Requests"
> A regular B&M store can refuse service to disruptive people and trespass people who don't comply, why not servers?
A Brick and Mortar store has to _tell_ you you're being banned. The mechanisms I listed above both tell you and lock the door whenever you attempt to access.
Edit: In this case, it's more like someone was looking in the store window from the public sidewalk and asked to stop. Can you really ask someone to stop looking at you from a public place?
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] cmiles74|8 years ago|reply
[+] [-] mirimir|8 years ago|reply
[+] [-] Miner49er|8 years ago|reply
[+] [-] icebraining|8 years ago|reply
[+] [-] k3a|8 years ago|reply
A different situation would be scraping a website to make business. Worst being directly using the data - for example those StackOverflow clones with original data doesn't sound ok to me. I am not sure what to think about bots doing various derived work like stats and analysis. I think that if they are part of a business, making money, it shouldn't be legal unless those request are permitted by robots.txt.
[+] [-] euske|8 years ago|reply
[+] [-] Semiapies|8 years ago|reply
You, however, go the extra mile, here. How about you explain exactly how accessing published information on a public website is like building a network of cameras to monitor a city with?
[+] [-] Dylan16807|8 years ago|reply
Boom. Easy to have both opinions.
I would love to limit corporate databases, but not via letting website owners declare arbitrary use to be criminal.
[+] [-] misterhtmlcss|8 years ago|reply
LinkedIn creates an implied covenant with public consent (mostly) to then publish and make discoverable their professional profiles.
While LinkedIn 100% should have the right to stop others from embedding without permission since it's possible to claim the data structure and presentation is proprietary to them, this should never extend to the actual data itself, since this was willing gifted by the actual owners (Joe public) into public domain.
I think an argument could be made that LinkedIn is being burdened with a degree of data mining that affects their business and therefore should be able to charge a minimal fee e.g. an API firehose to acquire the data in bulk from providers in an raw data stream.
That seems reasonable depending on the charges associated with that offer, this would be the correct compromise, since their data structure is all that actually separates their service from say About.me or any other site of that type. All of which don't disallow scraping; as long as it doesn't present as a DOS attack (of course).
Anyway my comments are as a marketer and not a programmer or lawyer, but personally I'm very interested to see this case resolved in a manner that doesn't suit LinkedIn in the slightest.
[+] [-] tptacek|8 years ago|reply
[+] [-] rayiner|8 years ago|reply
Servers are no different. The Internet isn't an abstraction--it's just pieces of private property connected together (servers, routers, switches). When you make an HTTP request, you're accessing a piece of private property. The owner of that property has every right to decide not to let you do so.
[+] [-] FilterSweep|8 years ago|reply
Protect them from what, your unlocked front door? [0][1]
[0] "Hackers selling 117 million LinkedIn passwords" http://money.cnn.com/2016/05/19/technology/linkedin-hack/ind...
[1] https://en.wikipedia.org/wiki/2012_LinkedIn_hack
I'd also note that these companies are barely (if ever) held liable for life-compromising hacks on their platforms.
[+] [-] megamindbrian2|8 years ago|reply
Edit: adding context.
I'm doing QA to validate information collected by my recruiting company, both acting within Linkedin's terms of service for a paid subscription, and violating their terms of server by improving my own company's process. Like the article said: Linkedin wants to participate in an open internet and also abuse CFAA.
[+] [-] ProdigalXiao|8 years ago|reply
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] askvictor|8 years ago|reply
[+] [-] FireBeyond|8 years ago|reply