top | item 7203277

(no title)

vwinsyee | 12 years ago

Contrary to the current HN title, the article points out:

Evidence presented during Private Manning’s court-martial for his role as the source for large archives of military and diplomatic files given to WikiLeaks revealed that he had used a program called “wget” to download the batches of files. That program automates the retrieval of large numbers of files, but it is considered less powerful than the tool Mr. Snowden used.

So the tool wasn't wget. curl, perhaps?

discuss

order

3pt14159|12 years ago

Having done this type of work before for a legitimate purpose, it is almost certainly a python or perl script with a nice library in front of it that makes it easy to follow links.

wget is too brittle, not extensible enough, and not as maintainable as a nice python script.

jrochkind1|12 years ago

I believe Manning actually used Windows batch scripting to automate wget, or so the government alleged from forensics at the trial. (I observed a couple days of the trial).

Manning did not have the tech skills of Snowden though, she wasn't neccesarily doing things in the most effective or elegant ways, but it worked.

wslh|12 years ago

Wget is also single threading which is a slow strategy to download pages.

sebastianavina|12 years ago

the other day I had the task to batch download product pictures from a website... every picture had a sessionid on the uri so I could't make a simple image wget. I wrote a simple python script that wrote a shell script with a lot of "wget -E -H -k -p \n sleep 30" and ran it trough a cloud server for a couple days... after that, some simple scripts for renaming the pictures, some regular expressions here and the, and voila, 250k perfectly named pictures for my product catalog... (it's for an intranet, so I guess I wont have copyright problems"

eli|12 years ago

FYI, you have exactly the same copyright issues on an intranet. You're just less likely to get caught, I guess.

kudu|12 years ago

curl is just a library with a slim command-line interface. It can't scrape pages by itself. Perhaps you're thinking of curlmirror? Even then, I doubt it can be considered more powerful than a good wget configuration.

kurtsiegfried|12 years ago

Nutch/Solr could provide a way to do a crawl, refine parameters, and then feed into a tool to download the actual resources.