How Web Scraping Is Revealing Lobbying and Corruption in Peru

[+] kilotaras|10 years ago|reply

I'm from Ukraine and the biggest success in battling corruption comes from system called Prozorro[1] (transparently) for government tenders.

It started as volunteer project and some projections put savings at around 10% of total budget after it will become mandatory in April.

[1] https://github.com/openprocurement/

[+] carlosp420|10 years ago|reply

Hi there, I am the author of the blog post. I will be happy to answer any question.

[+] gearhart|10 years ago|reply

This is great work. Forgive me if I'm missing it, but since the blog post implies you're aggregating and cleaning the data from several lists, is there any way to see the latest additions (RSS etc?) rather than directly searching for individuals?

It would make it more useful for flagging up potential stories, as well as researching stories journalists are already writing.

disclosure: I work for a company that provides real-time data to journalists for story discovery, and I know we'd certainly be interested

[+] nsoldiac|10 years ago|reply

Carlos, super buen trabajo, felicitaciones!! Llevo tiempo estudiando temas relacionados a tecnología vs. corrupción desde acá en Berkeley. Tengo testimonios interesantes de contactos que han vivido el cambio post-tecnología en el gobierno. Perú tiene harto potencial en esta área. Si necesitas ayuda en cualquier momento feliz de apoyarte!

[+] juandbarraza24|10 years ago|reply

Good work! It would be interesting to cross match the visits with any other source of information (newspaper, wikileaks, etc.) Over a timeline to recreate the hole event of someone. This will allow to identify patterns and their modus operandi.

[+] nsoldiac|10 years ago|reply

It would be interesting to see the volume of visits by government office year over year. I have a feeling that periods around elections might look very different. Also would be interested to see distribution color-coded by industry. Mining and contracting should pop up for certain time periods and government agencies.

[+] sergiotapia|10 years ago|reply

Existe alguna fuente de informacion como la de Peru, para Bolivia? Me imagino que hay mucho que descrubir sobre la corrupcion en Bolivia y el trafico de influencias.

[+] ecthiender|10 years ago|reply

Very interesting, how tools like these can be so much helpful for journalists and generally transparency in government functions.

Probably world changing, when considering that even semi-technical folks can cook up tools to dig into things like this.

I know this tool was by a developer, but scrapinghub has web UI to make scrapers.

[+] unsettledtck|10 years ago|reply

Full disclosure, I work for Scrapinghub and the web UI you speak of is Portia - our open source visual web scraper. It's for those who range from non-technical to technical but want a quick way to scrape data. I think it's extremely important to develop tools to democratize the acquisition of data regardless of technical background and skill. Glad you find the article and tool interesting!

[+] benologist|10 years ago|reply

A similar thing happened in Costa Rica -

    “You can’t visit 160,000 people,” she notes. “But 
    you can easily interrogate 160,000 records.”

http://foreignpolicy.com/2015/05/27/the-data-sleuths-of-san-...

[+] xiphias|10 years ago|reply

Can you draw a covisit graph of people? Who visited the building at the same times as somebody else. The strength of the connections could be visitedboth^2/( visitedwithouttheother1+1)*(visitedwithouttheother2+1)))

[+] alecco|10 years ago|reply

In other countries, corrupt politicians found out a simple captcha per n items is good enough to defeat analysis.

[+] smarx007|10 years ago|reply

https://anti-captcha.com/ & https://rucaptcha.com/ - I think that can be best summarised as "from Russia with love" :)

[+] danso|10 years ago|reply

FWIW, if you live in the U.S., then you benefit from having such data in great quantity, though I don't think it's sliced-and-diced to near the potential that it has:

Lobbyists have to follow registration procedures, and their official interactions and contributions are posted to an official database that can be downloaded as bulk XML:

http://www.senate.gov/legislative/lobbyingdisc.htm#lobbyingd...

Could they lie? Sure, but in the basic analysis that I've done, they generally don't feel the need to...or rather, things that I would have thought that lobbyists/causes would hide, they don't. Perhaps the consequences of getting caught (e.g. in an investigation that discovers a coverup) far outweigh the annoyance of filing the proper paperwork...having it recorded in a XML database that few people take the time to parse is probably enough obscurity for most situations.

There's also the White House visitor database, which does have some outright admissions, but still contains valuable information if you know how to filter the columns:

https://www.whitehouse.gov/briefing-room/disclosures/visitor...

But it's also a case (as it is with most data) where having some political knowledge is almost as important as being good at data-wrangling. For example, it's trivial to discover that Rahm Emanuel had few visitors despite is key role, so you'd have to be able to notice than and then take the extra step to find out his workaround:

http://www.nytimes.com/2010/06/25/us/politics/25caribou.html

And then there are the many bespoke systems and logs you can find if you do a little research. The FDA, for example, has a calendar of FDA officials' contacts with outside people...again, it might not contain everything but it's difficult enough to parse that being able to mine it (and having some domain knowledge) will still yield interesting insights: http://www.fda.gov/NewsEvents/MeetingsConferencesWorkshops/P...

There's also OIRA, which I haven't ever looked at but seems to have the same potential of finding underreported links if you have the patience to parse and text mine it: https://www.whitehouse.gov/omb/oira_0910_meetings/

And of course, there's just the good ol FEC contributions database, which at least shows you individuals (and who they work for): https://github.com/datahoarder/fec_individual_donors

This is not to undermine what's described in the OP...but just to show how lucky you are if you're in the U.S. when it comes to dealing with official records. They don't contain everything perhaps but there's definitely enough (nevermind what you can obtain through FOIA by being the first person to ask for things) out there to explore influence and politics without as many technical hurdles.

[+] hackuser|10 years ago|reply

Thanks; it's invaluable to hear from someone who has experience with the data.

Do you know what they are required to report? For example, if they have a 'social' dinner with a lobbyist, must that be reported? Are the requirements the same across the Executive Branch? All three branches?

[+] unsettledtck|10 years ago|reply

I just ran across https://www.opensecrets.org/ and found it quite useful and comprehensive in tracking contributions to candidates.

I live in the US and am privileged with the level of transparency that exists, but it's still not necessarily enough. Similar issues are present with the clunky nature of government websites and databases and so I think we're in agreement that it's not even close to the potential of what it could be.

Thanks for sharing all the links and information!

[+] jacquesm|10 years ago|reply

> which does have some outright admissions

Did you mean omissions?

[+] justinlardinois|10 years ago|reply

Damn. This is pretty impressive.

[+] prawn|10 years ago|reply

Peruvians, do you think this would cause a majority of meetings to be held outside of public office buildings or via secretive messaging system?

[+] dkarp|10 years ago|reply

This is really impressive, even more so by the fact that it has already led to discoveries being made.

Web scraping is a really powerful tool for increasing transparency on the internet especially with how transient online data is.

My own project, Transparent[1], has similar goals.

[1] https://www.transparentmetric.com/

[+] unknown|10 years ago|reply

[deleted]

[+] Angostura|10 years ago|reply

This is a fascinating project - If successful I suspect the result will be that lobbying to longer takes place in the government offices: "Shall we meet at that little place down the street", or will be carried out over the phone.

[+] jorgecurio|10 years ago|reply

Really interesting use of data extraction....

For developers and managers out there, do you prefer to build your own in-house scrapers or use Scrapy or tools like Mozenda instead? What about import.io and kimono?

I'm asking because lot of developers seem to be adamant against using web scraping tools they didn't develop themselves. Which seems counter productive because you are going into technical debt for an already solved problem.

So developers, what is the perfect web scraping tool you envision?

And it's always a fine balance between people who want to scrape Linkedin to spam people, others looking to do good with the data they scrape, and website owners who get aggressive and threatening when they realize they are getting scraped.

It seems like web scraping is a really shitty business to be in and nobody really wants to pay for it.

[+] unsettledtck|10 years ago|reply

Full disclosure, I work for Scrapinghub. Our tools are Scrapy and Portia, both open source and both free as in beer. Scrapy is for those who want fine-tuned manual control and who have a background in Python. Portia is the visual web scraper for those who are non-technical to technical but don't want to bother with code.

Web scraping is everywhere, even if it's not necessarily spoken openly about or acknowledged. The publicized perception of web scraping is fairly negative, but doesn't take into account the benefits of data used in machine-learning or democratized data extraction (as in the case of this article or for building public service apps like transportation notifications), or the simple realities of competitive pricing and monitoring the activities of resellers.

Researchers, academics, data scientists, marketers, the list goes on for those who use web scraping daily.

Glad you enjoyed the article! I'm hoping that more examples of ethical data extraction will start to turn the tide of public perception.

[+] emluque|10 years ago|reply

I recently did a website, that mines Argentinian Central Bank statistics daily and generates graphics and reports: http://estadisticasbcra.com/en

( The data that I'm mining is published here: http://www.bcra.gov.ar/Estadisticas/estprv010000.asp )

On this case, some scripts using Beautiful Soup were enough to get the job done, but I was completely unaware of Scrapy, seems like a fantastic tool, if I would have known about it I probably would have used it.

[+] predius|10 years ago|reply

I agree, this is definitely a solved problem!

If you need to build a solid web scraping stack which is going to be maintained by many people and is critical to your business, you have two options… to use Scrapy or to build something yourself.

Scrapy has been tried and tested over 6-7 years of community development, as well as being the base infrastructure for a number of >$1B businesses. Not only this, but there is a suite of tools which you have been built around it – Portia for one, but also other lots of useful open source libraries: http://scrapinghub.com/opensource/).

Right now most people still have the issue of having to use xpath or css selectors to run your crawl or get the data, but not for too long.

There's more and more ways of skipping this step and getting at data automatically: https://github.com/redapple/parslepy/wiki/Use-parslepy-with-... https://speakerdeck.com/amontalenti/web-crawling-and-metadat... https://github.com/scrapy/loginform https://github.com/TeamHG-Memex/Formasaurus https://github.com/scrapy/scrapely https://github.com/scrapinghub/webpager https://moz.com/devblog/benchmarking-python-content-extracti...

Scrapy (and also lots of python tools, likely a majority of them created by people using it and BeautifulSoup) have lowered the cost of building web data harvesting systems to the point where one guy can build crawlers for an entire industry in a couple of months.

[+] iamdave|10 years ago|reply

It doesn't scale very well, unless you have a lot of patience...but I've had immense success using the importxml() function in Google sheets to compile raw election data while I did some freelance work for the Texas Libertarian party a couple of years ago.

Outside of that, I did often find myself building my own tools with a combination of ruby, nokogiri and mechanize. Partly out of a desire to learn something new, and partly many of my use scenarios didn't require anything more complex than "Go to these pages, get the data within these elements and throw a CSV file over there".

[+] austinhutch|10 years ago|reply

After Kimono got shut down, I think a self-hosted open source version would be extremely popular. I want to build my own solution, but the API functionality and pagination / AJAX loaded data would be too difficult.

[+] logn|10 years ago|reply

> an already solved problem

It's a hard problem to generalize.

> balance between people who want to scrape Linkedin to spam people, others looking to do good with the data they scrape, and website owners who get aggressive and threatening when they realize they are getting scraped

Agreed. No one wants to be the bad guy and most clients looking to spam people are awful clients to have anyhow. Btw scraping LinkedIn is fairly difficult/expensive and they like to sue people.

[+] NiftyFifty|10 years ago|reply

[deleted]

[+] dang|10 years ago|reply

We've banned this account for repeatedly violating the HN guidelines.

We're happy to unban accounts when people give us reason to believe they will post only civil and substantive comments in the future. You're welcome to email [email protected] if that's the case.

[+] sneezeplease|10 years ago|reply

[deleted]

72 comments