top | item 20147381

BeGoneAds – A Python script that blocks ads by installing common hosts files

129 points| anned20 | 6 years ago |github.com | reply

92 comments

order
[+] AdmiralAsshat|6 years ago|reply
Python scripts to modify system files make me a little skittish, even with source code available. I think I would just as soon grab the hosts file from https://www.someonewhocares.org/ and drop it in myself.
[+] digitalni|6 years ago|reply
That hosts file is depressing. Blocking trackers by a1.tracker.name, a2.tracker. name etc. Today it seems easier to have a whitelist rather than blacklist...
[+] Redoubts|6 years ago|reply
I also don't understand systemd integration in the todo.
[+] daverobbins1|6 years ago|reply
How is this different or better than Steven Black's project?

Repo: https://github.com/StevenBlack/hosts

[+] anned20|6 years ago|reply
To be completely fair with you, I didn't know this was out there. Then again, I think my solution is more elegant, especially if all the todo's are finished.
[+] DyslexicAtheist|6 years ago|reply
my first question too. Steve Black's updateHostesFile.py is extensible and can be automated and is well trusted and gets tested by a large community. I don't want to sound critical but would like to understand the value add in comparison. Both are in python as well so I don't get it
[+] inlined|6 years ago|reply
Since accepting Host files from someone on the internet can be dangerous I dug into the code:

The list of hosts to exclude comes from several sites here: https://github.com/anned20/begoneads/blob/2c90fcee221edf71f8...

The actual application of the hosts file is here: https://github.com/anned20/begoneads/blob/2c90fcee221edf71f8...

I missed something though. Is a simple domain name per line enough to send that content to /dev/null? I haven’t used that form in /etc/hosts.

My primary concern was that this technique could be used to send ad traffic to a site that returns 404 but gathers metrics on the web regardless.

[+] js2|6 years ago|reply
I think you may be misreading the code. It concatenates the host files at the various URLs and then inserts their contents into /etc/hosts. I only looked at a couple of those files but the ones I did used either 0.0.0.0 or 127.0.0.1 as an address combined with a domain.

To be honest, this code is over-engineered. It could be a single script with a handful of functions. At the same time, it’s missing functionality such as deduplicating entries from the different lists.

[+] anned20|6 years ago|reply
This is actually sent to the IP address 0.0.0.0, it roughly means that all the traffic of the listed hosts is routed back to localhost
[+] sherincall|6 years ago|reply
I get that this is just someone's side project, I'm glad it exists and they're free to write it in their favorite language/environment and all; but the effort to actually run this is equivalent to actually copying the hosts files manually, and I already have all the dependencies installed. I could never get my non-techy parents to run this properly.

If the goal of the project is actual adoption, a native executable without external dependencies would have been a much better option.

[+] anned20|6 years ago|reply
This is a todo, It's already on PyPI and I'm working on getting it packaged for all the main distros of Linux/Windows and MacOS.
[+] barbecue_sauce|6 years ago|reply
Anybody have a sense of the performance overhead of using hosts files versus a detached hardware solution like a pihole?
[+] NikolaNovak|6 years ago|reply
My understanding is that difference is in scope, not performance.

Hosts files will only affect the host (workstation/desktop/laptop etc) they're installed on.

Things like piHole try to make it easy to apply the solution to all members of your network - which even in household cases these days can number in dozens, making it impractical to manage hosts files for all of them (This includes items like phones which are typically unfeasible to mess with hosts file).

[+] dredmorbius|6 years ago|reply
Not much.

62,448 line (63,370 actual '0.0.0.0' entries) /etc/hosts file, 100x resolving 'www.google.com', Debian GNU/Linux, Thinkpad with spinning rust.

The short version has 32 lines, with 14 active entries, mostly defaults and local systems.

Short hosts:

    $ for i in {1..100}; do time host www.google.com; done 2>&1| grep real |  sed 's/^real[       ]*//; s/0m//; s/s$//' | mean
    n: 100, sum: 2.209, min: 0.015, max: 0.052, mean: 0.022090, median: 0.02, sd: 0.007450
    %-ile:  5: 0.016, 10: 0.016, 15: 0.016, 20: 0.016, 
    25: 0.0165, 30: 0.02, 35: 0.02, 40: 0.02, 45: 0.02, 
    55: 0.02, 60: 0.02, 65: 0.02, 70: 0.021, 75: 0.022, 
    80: 0.0245, 85: 0.029, 90: 0.033, 95: 0.0385
Big hosts:

    $ for i in {1..100}; do time host www.google.com; done 2>&1| grep real |  sed 's/^real[       ]*//; s/0m//; s/s$//' | mean
    n: 100, sum: 2.517, min: 0.016, max: 0.063, mean: 0.025170, median: 0.023, sd: 0.009818
    %-ile:  5: 0.016, 10: 0.016, 15: 0.016, 20: 0.016, 
    25: 0.017, 30: 0.0185, 35: 0.02, 40: 0.021, 45: 0.022, 
    55: 0.024, 60: 0.0255, 65: 0.0265, 70: 0.028, 75: 0.029, 
    80: 0.03, 85: 0.0325, 90: 0.0395, 95: 0.042
The delta of means is .003080s -- call it 3ms slower for the large hosts file.

("mean" is an awk script for computing univariate moments.)

As others have mentioned, the main benefit of a centralised LAN service is that all devices on the LAN are protected. The hosts file on this system (a laptop) is effective regardless of where I am. It also pre-dates my configuring OpenWRT's adblock package about a month ago, though I'd had a hand-rolled DNSMasq configuration earlier. The laptop hosts file is almost certainly a few years out of date -- another occupational hazard of such things.

The OpenWRT solution runs on the Knot Resolver (kresd) caching nameserver. I've not noted any lag for it. The blocklist there is currently 231,627 hosts/domains (roughly doubled: specific + wildcard matches), from 0-29.com to zzzpooeaz-france.com.

[+] memco|6 years ago|reply
I used one of the popular hosts files on my local machine for a while: the networking didn’t seem to suffer, but the boot time for my machine slowed noticeably. And manual updates were painful because loading the file in an editor is slow so if you use your hosts file for other reasons it can inhibit your workflow. I would recommend automated process on some dedicated device so you don’t impact your normal usage.

Another experience I had was that certain sites failed to work correctly. I didn’t do extensive testing but when I disabled the hosts nocking the sites worked, when I enabled it they broke. These were companies with whom I was trying to do account related business: so it wasn’t just that something didn’t render correctly it actively prevented me from updating my accounts when I tried to submit requests.

I still like the approach and will continue to use it, but it hasn’t been frictionless.

[+] unethical_ban|6 years ago|reply
When the network goes down, it can take several minutes for it to come back up. I was having DNS and connectivity issues at a LAN party. Wouldn't get connectivity for minutes after a link bounce.

Then I removed the hosts file, and it worked instantly.

Maybe for a static workstation it wouldn't be bad, but for a laptop or something that loses link frequently, it could be an issue.

[+] LyndsySimon|6 years ago|reply
I don't have any evidence (I've not attempted to benchmark it or anything), but my gut says that the stack is checking the hosts file first anyhow, so it shouldn't be much. It might actually be an improvement over a separate appliance.
[+] gregw2|6 years ago|reply
I have cron jobs on my mac that update my hosts files (to block "addictive" sites in my case (not ads)). It doesn't really work.

Browsers cache and use outside DNS servers despite the hosts files. Chrome and sometimes Safari don't really honor the hosts files 100% of the time. Every once in a while I google around to try and restore my control, try to tweak my browser settings but I have yet to find anything that makes using hosts files bulletproof.

[+] mywittyname|6 years ago|reply
I think firewall rules would be your next line of defense. I'm not sure how configurable most home routers are though.
[+] rafaelvasco|6 years ago|reply
Reading the code one clearly sees why Python is so well suited for these kinds of applications, one-shot script executables: Really nice string ops, regex, file io etc. One of my favorite languages. The other is C# for everything else, that Python is not that suitable for: Huge complex codebases, type safeness, more strict performance requirements etc. Specially the static typing. The dynamism and lack of type annotations of Python really bothered me when I was developing a somewhat complex desktop app in it some years ago. I guess I'm a static typing guy with optional dynamism kinda person.
[+] misterdoubt|6 years ago|reply
If you haven't checked back lately, type annotations in Python are getting better and better. Built-in support via the typing module and a strong community package in mypy.
[+] bigend|6 years ago|reply
If you let someone else manage the hosts your computer resolves, you are trusting that someone as much as your ISP. A man in the middle.
[+] ris|6 years ago|reply
> You ran WHAT script on your machine?!
[+] mehrdadn|6 years ago|reply
Hosts files slow down the system as well as the browser itself. Get/create a browser extension to actually block the request (at least while your browser supports this) so you get immediate results.
[+] firefoxd|6 years ago|reply
I wish the hosts file could have an include directive. Since I regularly add or remove entries, the file becomes a mess.
[+] zactato|6 years ago|reply
Serious question. We all realize that the economics of the internet is largely fueled by ads, so why are we so keen to block them? It’s ad revenue that have allowed technology to flourish so strongly over the last two decades.
[+] dsswh|6 years ago|reply
Not long ago, the economy was largely fueled by slavery. Yet we got rid of that.
[+] aagd|6 years ago|reply
I'd say it's not so much about the ads, more about the way they're delivered with all those shady tricks for tracking and fingerprinting. IMHO ad blocking is more about privacy than anything else...
[+] ycombonator|6 years ago|reply
Ads have gotten to a point where they hamper user experience. My problem with most of the ads is the amount of js tracking junk and 3rd party A/B calls that grinds the browser to a halt.
[+] jormungand|6 years ago|reply
You are partially right. There are ads that are truly useful and those which are malicious, such as popunders which often include scammy pops. I use adblock to prevent the latter. There are popunder ad networks which are trying to fight against adblock by introducing "solutions" like anti-adblock, see here: https://propellerads.com/blog/anti-adblock-3-monetize-99-per...

I side with the adblock solutions in this war.

[+] pinguinFromY|6 years ago|reply
I would not mind contextual ads that don't get in the way of viewing the actual content a website or mobile app is offering. But no, we get huge banners that cover the whole background, pop-ups, every click opens a new tab redirecting to an ad and "hot chicks in your area". If I'm seeing an intersting ad I'll click on it by myself, don't need your help really. Conclusion: it's not the ads themselves but how they get in your way. See reddit ads, google ads. They are part of the actual content.
[+] geggam|6 years ago|reply
Internet was a better place when the economics of the internet was driven by porn.

</2 cents>

[+] mtgx|6 years ago|reply
Whether or not ads on the internet are good is not that clear cut. But I would say 99% of tracking is both evil and mostly useless, probably with the 1% being first-party analytics to track some very simple stuff like page views.

I'm glad Firefox is now blocking third-party trackers by default (not that I needed it for myself, but it's important for others to have this).

[+] appleflaxen|6 years ago|reply
how does the list compare to the pi hole hostfile?
[+] hlau|6 years ago|reply
I'll be the first to admit that the existing advertising ecosystem is broken, primarily due to misaligned incentives across the board. But, given a choice, would you rather have a clearly labeled thing that you know is an ad transparently trying to influence you or a sneaky human billboard, err "influencer" coming up to you with an agenda along with tons of product placement in whatever you watch/read/listen to?
[+] harry8|6 years ago|reply
There's no either/or decision to be made here. You get compromised, paid for content with our without ads as well. Critical thinking I'd a requirement always.