This is great work. Forgive me if I'm missing it, but since the blog post implies you're aggregating and cleaning the data from several lists, is there any way to see the latest additions (RSS etc?) rather than directly searching for individuals?
It would make it more useful for flagging up potential stories, as well as researching stories journalists are already writing.
disclosure: I work for a company that provides real-time data to journalists for story discovery, and I know we'd certainly be interested
Carlos, super buen trabajo, felicitaciones!!
Llevo tiempo estudiando temas relacionados a tecnología vs. corrupción desde acá en Berkeley. Tengo testimonios interesantes de contactos que han vivido el cambio post-tecnología en el gobierno. Perú tiene harto potencial en esta área. Si necesitas ayuda en cualquier momento feliz de apoyarte!
Good work! It would be interesting to cross match the visits with any other source of information (newspaper, wikileaks, etc.) Over a timeline to recreate the hole event of someone. This will allow to identify patterns and their modus operandi.
It would be interesting to see the volume of visits by government office year over year. I have a feeling that periods around elections might look very different. Also would be interested to see distribution color-coded by industry. Mining and contracting should pop up for certain time periods and government agencies.
Existe alguna fuente de informacion como la de Peru, para Bolivia? Me imagino que hay mucho que descrubir sobre la corrupcion en Bolivia y el trafico de influencias.
Full disclosure, I work for Scrapinghub and the web UI you speak of is Portia - our open source visual web scraper. It's for those who range from non-technical to technical but want a quick way to scrape data. I think it's extremely important to develop tools to democratize the acquisition of data regardless of technical background and skill. Glad you find the article and tool interesting!
Can you draw a covisit graph of people? Who visited the building at the same times as somebody else. The strength of the connections could be visitedboth^2/( visitedwithouttheother1+1)*(visitedwithouttheother2+1)))
FWIW, if you live in the U.S., then you benefit from having such data in great quantity, though I don't think it's sliced-and-diced to near the potential that it has:
Lobbyists have to follow registration procedures, and their official interactions and contributions are posted to an official database that can be downloaded as bulk XML:
Could they lie? Sure, but in the basic analysis that I've done, they generally don't feel the need to...or rather, things that I would have thought that lobbyists/causes would hide, they don't. Perhaps the consequences of getting caught (e.g. in an investigation that discovers a coverup) far outweigh the annoyance of filing the proper paperwork...having it recorded in a XML database that few people take the time to parse is probably enough obscurity for most situations.
There's also the White House visitor database, which does have some outright admissions, but still contains valuable information if you know how to filter the columns:
But it's also a case (as it is with most data) where having some political knowledge is almost as important as being good at data-wrangling. For example, it's trivial to discover that Rahm Emanuel had few visitors despite is key role, so you'd have to be able to notice than and then take the extra step to find out his workaround:
And then there are the many bespoke systems and logs you can find if you do a little research. The FDA, for example, has a calendar of FDA officials' contacts with outside people...again, it might not contain everything but it's difficult enough to parse that being able to mine it (and having some domain knowledge) will still yield interesting insights: http://www.fda.gov/NewsEvents/MeetingsConferencesWorkshops/P...
There's also OIRA, which I haven't ever looked at but seems to have the same potential of finding underreported links if you have the patience to parse and text mine it: https://www.whitehouse.gov/omb/oira_0910_meetings/
This is not to undermine what's described in the OP...but just to show how lucky you are if you're in the U.S. when it comes to dealing with official records. They don't contain everything perhaps but there's definitely enough (nevermind what you can obtain through FOIA by being the first person to ask for things) out there to explore influence and politics without as many technical hurdles.
Thanks; it's invaluable to hear from someone who has experience with the data.
Do you know what they are required to report? For example, if they have a 'social' dinner with a lobbyist, must that be reported? Are the requirements the same across the Executive Branch? All three branches?
I just ran across https://www.opensecrets.org/ and found it quite useful and comprehensive in tracking contributions to candidates.
I live in the US and am privileged with the level of transparency that exists, but it's still not necessarily enough. Similar issues are present with the clunky nature of government websites and databases and so I think we're in agreement that it's not even close to the potential of what it could be.
This is a fascinating project - If successful I suspect the result will be that lobbying to longer takes place in the government offices: "Shall we meet at that little place down the street", or will be carried out over the phone.
For developers and managers out there, do you prefer to build your own in-house scrapers or use Scrapy or tools like Mozenda instead? What about import.io and kimono?
I'm asking because lot of developers seem to be adamant against using web scraping tools they didn't develop themselves. Which seems counter productive because you are going into technical debt for an already solved problem.
So developers, what is the perfect web scraping tool you envision?
And it's always a fine balance between people who want to scrape Linkedin to spam people, others looking to do good with the data they scrape, and website owners who get aggressive and threatening when they realize they are getting scraped.
It seems like web scraping is a really shitty business to be in and nobody really wants to pay for it.
Full disclosure, I work for Scrapinghub. Our tools are Scrapy and Portia, both open source and both free as in beer. Scrapy is for those who want fine-tuned manual control and who have a background in Python. Portia is the visual web scraper for those who are non-technical to technical but don't want to bother with code.
Web scraping is everywhere, even if it's not necessarily spoken openly about or acknowledged. The publicized perception of web scraping is fairly negative, but doesn't take into account the benefits of data used in machine-learning or democratized data extraction (as in the case of this article or for building public service apps like transportation notifications), or the simple realities of competitive pricing and monitoring the activities of resellers.
Researchers, academics, data scientists, marketers, the list goes on for those who use web scraping daily.
Glad you enjoyed the article! I'm hoping that more examples of ethical data extraction will start to turn the tide of public perception.
On this case, some scripts using Beautiful Soup were enough to get the job done, but I was completely unaware of Scrapy, seems like a fantastic tool, if I would have known about it I probably would have used it.
If you need to build a solid web scraping stack which is going to be maintained by many people and is critical to your business, you have two options… to use Scrapy or to build something yourself.
Scrapy has been tried and tested over 6-7 years of community development, as well as being the base infrastructure for a number of >$1B businesses. Not only this, but there is a suite of tools which you have been built around it – Portia for one, but also other lots of useful open source libraries: http://scrapinghub.com/opensource/).
Right now most people still have the issue of having to use xpath or css selectors to run your crawl or get the data, but not for too long.
Scrapy (and also lots of python tools, likely a majority of them created by people using it and BeautifulSoup) have lowered the cost of building web data harvesting systems to the point where one guy can build crawlers for an entire industry in a couple of months.
It doesn't scale very well, unless you have a lot of patience...but I've had immense success using the importxml() function in Google sheets to compile raw election data while I did some freelance work for the Texas Libertarian party a couple of years ago.
Outside of that, I did often find myself building my own tools with a combination of ruby, nokogiri and mechanize. Partly out of a desire to learn something new, and partly many of my use scenarios didn't require anything more complex than "Go to these pages, get the data within these elements and throw a CSV file over there".
After Kimono got shut down, I think a self-hosted open source version would be extremely popular. I want to build my own solution, but the API functionality and pagination / AJAX loaded data would be too difficult.
> balance between people who want to scrape Linkedin to spam people, others looking to do good with the data they scrape, and website owners who get aggressive and threatening when they realize they are getting scraped
Agreed. No one wants to be the bad guy and most clients looking to spam people are awful clients to have anyhow. Btw scraping LinkedIn is fairly difficult/expensive and they like to sue people.
We've banned this account for repeatedly violating the HN guidelines.
We're happy to unban accounts when people give us reason to believe they will post only civil and substantive comments in the future. You're welcome to email [email protected] if that's the case.
[+] [-] kilotaras|10 years ago|reply
It started as volunteer project and some projections put savings at around 10% of total budget after it will become mandatory in April.
[1] https://github.com/openprocurement/
[+] [-] carlosp420|10 years ago|reply
[+] [-] gearhart|10 years ago|reply
It would make it more useful for flagging up potential stories, as well as researching stories journalists are already writing.
disclosure: I work for a company that provides real-time data to journalists for story discovery, and I know we'd certainly be interested
[+] [-] nsoldiac|10 years ago|reply
[+] [-] juandbarraza24|10 years ago|reply
[+] [-] nsoldiac|10 years ago|reply
[+] [-] sergiotapia|10 years ago|reply
[+] [-] ecthiender|10 years ago|reply
Probably world changing, when considering that even semi-technical folks can cook up tools to dig into things like this.
I know this tool was by a developer, but scrapinghub has web UI to make scrapers.
[+] [-] unsettledtck|10 years ago|reply
[+] [-] benologist|10 years ago|reply
[+] [-] xiphias|10 years ago|reply
[+] [-] alecco|10 years ago|reply
[+] [-] smarx007|10 years ago|reply
[+] [-] danso|10 years ago|reply
Lobbyists have to follow registration procedures, and their official interactions and contributions are posted to an official database that can be downloaded as bulk XML:
http://www.senate.gov/legislative/lobbyingdisc.htm#lobbyingd...
Could they lie? Sure, but in the basic analysis that I've done, they generally don't feel the need to...or rather, things that I would have thought that lobbyists/causes would hide, they don't. Perhaps the consequences of getting caught (e.g. in an investigation that discovers a coverup) far outweigh the annoyance of filing the proper paperwork...having it recorded in a XML database that few people take the time to parse is probably enough obscurity for most situations.
There's also the White House visitor database, which does have some outright admissions, but still contains valuable information if you know how to filter the columns:
https://www.whitehouse.gov/briefing-room/disclosures/visitor...
But it's also a case (as it is with most data) where having some political knowledge is almost as important as being good at data-wrangling. For example, it's trivial to discover that Rahm Emanuel had few visitors despite is key role, so you'd have to be able to notice than and then take the extra step to find out his workaround:
http://www.nytimes.com/2010/06/25/us/politics/25caribou.html
And then there are the many bespoke systems and logs you can find if you do a little research. The FDA, for example, has a calendar of FDA officials' contacts with outside people...again, it might not contain everything but it's difficult enough to parse that being able to mine it (and having some domain knowledge) will still yield interesting insights: http://www.fda.gov/NewsEvents/MeetingsConferencesWorkshops/P...
There's also OIRA, which I haven't ever looked at but seems to have the same potential of finding underreported links if you have the patience to parse and text mine it: https://www.whitehouse.gov/omb/oira_0910_meetings/
And of course, there's just the good ol FEC contributions database, which at least shows you individuals (and who they work for): https://github.com/datahoarder/fec_individual_donors
This is not to undermine what's described in the OP...but just to show how lucky you are if you're in the U.S. when it comes to dealing with official records. They don't contain everything perhaps but there's definitely enough (nevermind what you can obtain through FOIA by being the first person to ask for things) out there to explore influence and politics without as many technical hurdles.
[+] [-] hackuser|10 years ago|reply
Do you know what they are required to report? For example, if they have a 'social' dinner with a lobbyist, must that be reported? Are the requirements the same across the Executive Branch? All three branches?
[+] [-] unsettledtck|10 years ago|reply
I live in the US and am privileged with the level of transparency that exists, but it's still not necessarily enough. Similar issues are present with the clunky nature of government websites and databases and so I think we're in agreement that it's not even close to the potential of what it could be.
Thanks for sharing all the links and information!
[+] [-] jacquesm|10 years ago|reply
Did you mean omissions?
[+] [-] justinlardinois|10 years ago|reply
[+] [-] prawn|10 years ago|reply
[+] [-] dkarp|10 years ago|reply
Web scraping is a really powerful tool for increasing transparency on the internet especially with how transient online data is.
My own project, Transparent[1], has similar goals.
[1] https://www.transparentmetric.com/
[+] [-] unknown|10 years ago|reply
[deleted]
[+] [-] Angostura|10 years ago|reply
[+] [-] jorgecurio|10 years ago|reply
For developers and managers out there, do you prefer to build your own in-house scrapers or use Scrapy or tools like Mozenda instead? What about import.io and kimono?
I'm asking because lot of developers seem to be adamant against using web scraping tools they didn't develop themselves. Which seems counter productive because you are going into technical debt for an already solved problem.
So developers, what is the perfect web scraping tool you envision?
And it's always a fine balance between people who want to scrape Linkedin to spam people, others looking to do good with the data they scrape, and website owners who get aggressive and threatening when they realize they are getting scraped.
It seems like web scraping is a really shitty business to be in and nobody really wants to pay for it.
[+] [-] unsettledtck|10 years ago|reply
Web scraping is everywhere, even if it's not necessarily spoken openly about or acknowledged. The publicized perception of web scraping is fairly negative, but doesn't take into account the benefits of data used in machine-learning or democratized data extraction (as in the case of this article or for building public service apps like transportation notifications), or the simple realities of competitive pricing and monitoring the activities of resellers.
Researchers, academics, data scientists, marketers, the list goes on for those who use web scraping daily.
Glad you enjoyed the article! I'm hoping that more examples of ethical data extraction will start to turn the tide of public perception.
[+] [-] emluque|10 years ago|reply
( The data that I'm mining is published here: http://www.bcra.gov.ar/Estadisticas/estprv010000.asp )
On this case, some scripts using Beautiful Soup were enough to get the job done, but I was completely unaware of Scrapy, seems like a fantastic tool, if I would have known about it I probably would have used it.
[+] [-] predius|10 years ago|reply
If you need to build a solid web scraping stack which is going to be maintained by many people and is critical to your business, you have two options… to use Scrapy or to build something yourself.
Scrapy has been tried and tested over 6-7 years of community development, as well as being the base infrastructure for a number of >$1B businesses. Not only this, but there is a suite of tools which you have been built around it – Portia for one, but also other lots of useful open source libraries: http://scrapinghub.com/opensource/).
Right now most people still have the issue of having to use xpath or css selectors to run your crawl or get the data, but not for too long.
There's more and more ways of skipping this step and getting at data automatically: https://github.com/redapple/parslepy/wiki/Use-parslepy-with-... https://speakerdeck.com/amontalenti/web-crawling-and-metadat... https://github.com/scrapy/loginform https://github.com/TeamHG-Memex/Formasaurus https://github.com/scrapy/scrapely https://github.com/scrapinghub/webpager https://moz.com/devblog/benchmarking-python-content-extracti...
Scrapy (and also lots of python tools, likely a majority of them created by people using it and BeautifulSoup) have lowered the cost of building web data harvesting systems to the point where one guy can build crawlers for an entire industry in a couple of months.
[+] [-] iamdave|10 years ago|reply
Outside of that, I did often find myself building my own tools with a combination of ruby, nokogiri and mechanize. Partly out of a desire to learn something new, and partly many of my use scenarios didn't require anything more complex than "Go to these pages, get the data within these elements and throw a CSV file over there".
[+] [-] austinhutch|10 years ago|reply
[+] [-] logn|10 years ago|reply
It's a hard problem to generalize.
> balance between people who want to scrape Linkedin to spam people, others looking to do good with the data they scrape, and website owners who get aggressive and threatening when they realize they are getting scraped
Agreed. No one wants to be the bad guy and most clients looking to spam people are awful clients to have anyhow. Btw scraping LinkedIn is fairly difficult/expensive and they like to sue people.
[+] [-] NiftyFifty|10 years ago|reply
[deleted]
[+] [-] dang|10 years ago|reply
We're happy to unban accounts when people give us reason to believe they will post only civil and substantive comments in the future. You're welcome to email [email protected] if that's the case.
[+] [-] sneezeplease|10 years ago|reply
[deleted]