Show HN: Using ML to detect key changes to 2020 Candidates' Websites

[+] giancarlostoro|6 years ago|reply

These look like diffs to me, which I'm okay with. Not sure where the ML will come to play, and honestly, just calling it a site that shows "updates" or "changes" to 2020 candidate sites is good enough.

[+] bluepeter|6 years ago|reply

Thanks... the bulk of the site is made up of diffs. We score them w/ our ML model. We aren't (yet) surfacing those scores on a page by page basis. Rather, we are simply collecting the stand-out examples detected by ML in the top card as links to those specific diffs.

Our bad for not including the scores for each page... our thoughts w/ this content piece were it would be attractive to journalists who may not have an appreciation for underlying scores.

[+] alexcnwy|6 years ago|reply

Very cool idea.

It's not clear where you're using ML from the site - URLs seem chronological, not ranked on "relevance/importance" and I can't see any relevance/importance indicators.

I'm curious to hear some more detail on how you're encoding your visual diff model?

BTW I ran into 2 issues: 1. You can't zoom out on your image diff slider 2. I got this error when I returned to the site after closing it: {"crossDomain" : true, "method" : "GET", "url" : "https : //api.fluxguard.com/publi c/site/7f646558-f754-447a -b627-9b5202c8a1f2/page?l imit=10&publicAccount=cam paignmonitor"} Please contact us if this is error is happening frequently for you.

[+] bluepeter|6 years ago|reply

Thanks for the feedback... right now we are only using extracted text (and some DOM data) for ML. We aren't using the images for any ML work because, as you likely suspect, that's pretty hard to do in a meaningful way.

We aren't surfacing the per page scores at the moment... most of the ML work was done specifically for this content piece, so we haven't adjusted our core presentation to include it (including the flag grading system which we use to build the training model), other than simply listing stand-out examples at the top of the site.

For our customers, we likely will need to build industry/use-specific models. (Amazon is already sort of doing this by providing pharm-specific text classifiers.) Use cases are so disparate at the moment that it's hard to build a general model for everyone.

As for the cross-domain errors, grrr! Thanks, we will look into it... hard to troubleshoot those. Our API stack consists of CloudFormation -> API Gateway -> API Gateway Caching -> Lambda -> DynamoDB... (edit AKA who knows where those errors are! haha)

(And more detail on the problem you are having w/ image diff slider would be appreciated... feel free to email us directly from email at bottom of site. Not sure I understand this issue you're having.)

(edit feel free to email me w/ error details at peter (at) deepdiveduck . com ... the cross-domain scripting errors are a constant thorn in our sides so we'd like to get any other info you have on 'em)

[+] bluepeter|6 years ago|reply

Hi folks: So we're monitoring most major 2020 Presidential Candidates' sites for visual + HTML/DOM + network + extracted text changes. (You can see all detected changed at the above link.) There's a lot of noise! So we're using ML to identify significant changes. (You can see these findings so far at the top of the page.)

We've trained our model using detected changes from corporate sites and some earlier political sites. Each change for our model was human-rated in terms of relevance/importance, and we also feed in other descriptive attributes about each change, such as DOM location, immediate parent tag, and several other attributes.

[+] social_quotient|6 years ago|reply

Congrats on what you’ve made!

What sort of noise are you running in to? (Curious)

In an old project we actually did something a bit more based on visual changes. Basically detected visual diffs and got coordinates of offsets to the change. We then found the smallest html container that encompassed the diff and highlighted it.

Using visual diff you can fuzz things a bit to handle artifact and small movements.

A good lib we have some miles on https://github.com/mapbox/pixelmatch

And this write up on niffy is pretty good

https://segment.com/blog/perceptual-diffing-with-niffy/

[+] bhl|6 years ago|reply

To compute the text differences, are you using the HTML/DOM and rendered webpage to extract the text? Or are the method for each diff separately determined? I'm curious to what the input and output of the ML model consists of.

11 comments