bluepeter's comments

bluepeter | 2 hours ago | on: The Banality of Surveillance

I'm in my 50s and when I was early 20s I crossed from US to Canada for a business meeting. "Why are you coming to Canada?" "To work." "Where's your work permit?" "Huh, I don't have one." That simple "wrong word" slip STILL gets me flagged and cordoned off into hours-long border diversions whenever I go to Canada.

Just imagine how it'll be now... for decades you'll be fending off some hidden receipts from an IG comment you made.

bluepeter | 3 days ago | on: New York could prohibit chatbot medical, legal, engineering advice

This is not directionally good because NY already has laws against unauthorized professional practice and deceptive conduct, and S7263 mainly replaces regulator-led enforcement with a vague, fee-shifting private cause of action that is likely to drive serial plaintiff litigation while chilling useful consumer guidance.

bluepeter | 3 days ago | on: New York could prohibit chatbot medical, legal, engineering advice

Whats at least somewhat humorous is the disclaimer requirement that "[t]he text of the notice shall [be] no smaller than the largest font size of other text appearing on the website on which the chatbot is utilized."

H1 hero font size here we come for disclaimers! (Which don't do anything, per the bill, anyway.) But also is the fancy thought that chatbots only appear on websites.

bluepeter | 2 years ago | on: Software firms across US facing tax bills that threaten survival

Section 174 intro: "In general in the case of a taxpayer’s specified research or experimental expenditures for any taxable year".

Note limiting phrase.

Section 174 later: "For purposes of this section, the term “specified research or experimental expenditures” means, with respect to any taxable year, research or experimental expenditures which are paid or incurred by the taxpayer during such taxable year in connection with the taxpayer’s trade or business."

Note the "in connection with the taxpayer's trade or business" and look up the definition of that phrasing versus "carrying on" business and compare to Section 162. (e.g., Snow v. Commissioner of Internal Revenue, 416 U.S. 500 (1974), Cantor v. Commissioner of Internal Revenue, 998 F.2d 1514 (1993), Scoggins v. Commissioner of Internal Revenue, 46 F.3d 950 (1995))

Section 174 later: "For purposes of this section, any amount paid or incurred in connection with the development of any software shall be treated as a research or experimental expenditure."

Note the limiting phrase.

Ultimately we need guidance from the Service but the above are (possibly aggressive) readings some CPAs are taking.

bluepeter | 6 years ago | on: Show HN: Using ML to detect key changes to 2020 Candidates' Websites

Yeah, I am not sure it is a CORS error, though that wouldn't surprise me. This looks to be an error from our API Gateway validation rules (which, at least for us, is notoriously difficult to get to send any more error data to the client)... that is to say, this will typically occur when one of the form or query string params is sending illegal data. I tried to repro last night w/ no luck. This sort of error isn't in the error logs (that we typically monitor)...

bluepeter | 6 years ago | on: Show HN: Using ML to detect key changes to 2020 Candidates' Websites

In terms of overall noise, we're running into a lot. (This is what led us to ML as a way to hopefully reduce it.)

Comparing the pure DOM reveals almost constant change due to various inserted Javascript from Google/Facebook/etc modifying the DOM. Looking at text/images also result in a fair amount of expected noise, but it's mostly in the form of interstitial marketing banners, fundraising targets, etc.

We already have various options to "filter" out certain DOM areas. (As an example, remove all footers, headers, or any other CSS selector.) These work really well... but they require a fair bit of setup.

Thanks for the pixelmatch GH repo link... I have it starred so must have taken a look in the past. We need to evolve our image diff, so we may end up using this!

bluepeter | 6 years ago | on: Show HN: Using ML to detect key changes to 2020 Candidates' Websites

Thanks... the bulk of the site is made up of diffs. We score them w/ our ML model. We aren't (yet) surfacing those scores on a page by page basis. Rather, we are simply collecting the stand-out examples detected by ML in the top card as links to those specific diffs.

Our bad for not including the scores for each page... our thoughts w/ this content piece were it would be attractive to journalists who may not have an appreciation for underlying scores.

bluepeter | 6 years ago | on: Show HN: Using ML to detect key changes to 2020 Candidates' Websites

Each diff is done slightly differently depending on what it is... we use headless Chrome + Puppeteer on Fargate for crawling. We use Puppeteer itself to take screenshots and output HTML. We then use a separate Lambda function to extract the text.

From that, we feed the results into "diff" Lambda functions to compute image, text, HTML, and network diffs. We treat the text diff as the primary diff type, and so only if we have a text diff do we do the other comparison types.

From these diffs, we then feed the text + some DOM info into ML. There, we use added text, deleted text, and, for each, the shortest unique CSS selector, the immediate parent tag, and some other items that I may be missing (possibly some approximation of where in the main text the change appears... e.g., top 10%, top 20% IIRC).

Hopefully this answers your question?

bluepeter | 6 years ago | on: Show HN: Using ML to detect key changes to 2020 Candidates' Websites

Thanks for the feedback... right now we are only using extracted text (and some DOM data) for ML. We aren't using the images for any ML work because, as you likely suspect, that's pretty hard to do in a meaningful way.

We aren't surfacing the per page scores at the moment... most of the ML work was done specifically for this content piece, so we haven't adjusted our core presentation to include it (including the flag grading system which we use to build the training model), other than simply listing stand-out examples at the top of the site.

For our customers, we likely will need to build industry/use-specific models. (Amazon is already sort of doing this by providing pharm-specific text classifiers.) Use cases are so disparate at the moment that it's hard to build a general model for everyone.

As for the cross-domain errors, grrr! Thanks, we will look into it... hard to troubleshoot those. Our API stack consists of CloudFormation -> API Gateway -> API Gateway Caching -> Lambda -> DynamoDB... (edit AKA who knows where those errors are! haha)

(And more detail on the problem you are having w/ image diff slider would be appreciated... feel free to email us directly from email at bottom of site. Not sure I understand this issue you're having.)

(edit feel free to email me w/ error details at peter (at) deepdiveduck . com ... the cross-domain scripting errors are a constant thorn in our sides so we'd like to get any other info you have on 'em)

bluepeter | 6 years ago | on: Show HN: Using ML to detect key changes to 2020 Candidates' Websites

Hi folks: So we're monitoring most major 2020 Presidential Candidates' sites for visual + HTML/DOM + network + extracted text changes. (You can see all detected changed at the above link.) There's a lot of noise! So we're using ML to identify significant changes. (You can see these findings so far at the top of the page.)

We've trained our model using detected changes from corporate sites and some earlier political sites. Each change for our model was human-rated in terms of relevance/importance, and we also feed in other descriptive attributes about each change, such as DOM location, immediate parent tag, and several other attributes.

bluepeter | 6 years ago | on: Show HN: "Network diff” detects new scripts or data exfiltration on websites

Let me know if all y'all have any questions! Fluxguard provides web change monitoring and alerts. We take screenshots, fully render the DOM... and we alert you to any DOM, pixel, or text changes.

Our new "network diff" feature goes one step further.

It creates a HAR file for all network activity on any page (including for complex form submission pages such as shopping carts). We repeatedly crawl this page (or sequences of pages). And we look for changes to network activity.

This way, you can catch and alert any new XHR, image, script, or other resource activity on any page of your site.

You can use whitelists to exclude certain domains from analysis (e.g., google.com). Lot of other config options let you further reduce false positives.

Why'd you want to do this?

Magecart and other hacking groups use cross-site scripting, poisoned NPM modules, DNS spoofing, and so many other attacks to exfiltrate data from Magento and other CMS.

It's hard to stop these guys as they are adept at covering their tracks. Edge protection systems aren't great if the attacker is coming "from inside the house."

Our new network diff crawls your live site repeatedly. We orchestrate common user journeys -- creating an account, ordering a product -- and we look for any network activity that shouldn't be there.

Cool, eh?

(Sorry for the wall of text.)

bluepeter | 6 years ago | on: Google Search is routinely gamed by private blog networks

As a legit biz trying to win legit links, this is such a hassle to deal with.

Perhaps even worse is the rise of "barely legal" blogs... (though these may be the same thing as PBN I suppose), but, by these, I mean, sure, they're original content, original "reviews," and "legitimate" blogs.

But the articles are pumped out en masse, often written in sub-par English, with nothing more than re-wording of a reviewed site's "about us" page. (Perhaps they're even training Markov models on reviewed sites' content?) But they increasingly dominate searches, particularly in the B2B and B2C tech space.

Do these serve customers' search intent? They're simple gateway pages: the content is often not really "readable." But since Google favors "reviews" and pages that link to many other top 10 SERPs, they dominate, vs legit product pages.

Not far on the list of deplorables is the rise of the "tech stack" lists. Just endless lists of "alternatives to X," and "rankings of XYZ products" (with next to no legit reviews), or "here's the stack this company uses." All designed to get widespread long tail links.

bluepeter | 6 years ago | on: Mysterious illness that paralyzes healthy kids prompts plea from CDC

I've been following this closely. There's no real data on the following:

- Is this happening in other countries? I cannot find really any established diagnoses outside of the US (though some countries have CDC-like pages on it).

- What is the gender breakdown?

- What is the age breakdown? (mean, median, SD)

- What is the geo breakdown? (city vs rural, major geo areas)

page 1