textmode's comments

textmode | 2 years ago | on: Remove HTTP headers from gzip or zip on stdin yy054 (revised)

Correction:

      /* remove HTTP headers from multiple gzip or single zip from stdin */
    
     int fileno (FILE *);
     int setenv (const char *, const char *, int);
     #define jmp (yy_start) = 1 + 2 *
     int x;
    %option nounput noinput noyywrap
    %%
    HTTP\/[\40-\176]+\x0d\x0a x++;
    [\40-\176]+:[\40-\176]+\r\n if(!x)fwrite(yytext,1,yyleng,yyout);
    \x0D\x0A if(!x)fwrite(yytext,1,yyleng,yyout);x=0;
    %%
    int main()
    { 
    yylex();
    exit(0);
    }

Usage example:

Retrieve hostnames, IP addresses and (if available) sitemap URLs from latest Common Crawl.

     ftp -4 https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/robotstxt.paths.gz # <-- 180K
     gzip -dc robotstxt.paths.gz \
     |head -5 \
     |sed 's>.*>GET /& HTTP/1.1[]Host: data.commoncrawl.org[]Connection: >;
           $!s/$/keep-alive[]/;$s/$/close[]/' \
     |tr [] '\r\n' \
     |openssl s_client -quiet -connect data.commoncrawl.org:443 \
     |yy054 \
     |zegrep -a '(^Sitemap:)|(^Host:)|(^WARC-Target-URI:)|(^WARC-IP-Address:)' > 1.txt
     exec cat 1.txt

textmode | 2 years ago | on: Remove HTTP headers from gzip or zip on stdin yy054 (revised)

Usage example:

Download NetBSD 1.0 in a single TCP connection.

    y="GET /pub/NetBSD-archive/NetBSD-1.0/source/src10/"
    z="Host: archive.netbsd.org"
    sed '$!s>.*>'"$y"'& HTTP/1.1[]'"$z"'[]Connection: keep-alive[]>;
         $s>.*>'"$y"'& HTTP/1.0[]'"$z"'[]>' << eof \
    |tr '[]' '\r\n' \
    |openssl s_client -quiet -connect 151.101.129.6:443 -servername archive.netbsd.org > http+gzip
    src10.aa
    src10.ab
    src10.ac
    src10.ad
    src10.ae
    src10.af
    src10.ag
    src10.ah
    src10.ai
    src10.aj
    src10.ak
    src10.al
    src10.am
    src10.an
    src10.ao
    src10.ap
    src10.aq
    src10.ar
    src10.as
    src10.at
    src10.au
    src10.av
    src10.aw
    src10.ax
    src10.ay
    src10.az
    src10.ba
    src10.bb
    src10.bc
    src10.bd
    src10.be
    src10.bf
    eof

    yy054 < http+gzip|tar tvzf /dev/stdin
Alternate usage:

Include an argv[1] will print HTTP headers only

    yy054 print < http+gzip
    yy054 x < http+gzip

textmode | 2 years ago | on: Extract URLs Relative and/or Absolute yy044

Normally I use yy030 but I have been experimenting with this instead.

Seems to be slightly faster and smaller than similar programs from html-xml-utils.

https://www.w3.org/Tools/HTML-XML-utils/man1/

Compile:

   links -no-connect -dump https://news.ycombinator.com/item?id=38727772 \
   |sed '1,4d;77,$d;s/[ ]\{6\}//' \
   |flex -8Cem;cc -O3 -std=c89 -W -Wall -pipe lex.yy.c -static -o yy044
   strip -s yy044
Example usage:

      # NB. not a real cookie
      curl -H "cookie=user=santa&K7RGzmUtAoKv9OIRMfQ9bfwYpiDEuypp" -siA "" \
      https://news.ycombinator.com \
      |host=news.ycombinator.com/ yy044 r \
      |sed -n 's/&amp;/\&/g;/vote/p'

textmode | 3 years ago | on: Chunked-transfer decoding from stdin yy045

   /* chunked transfer decoding */
   
    #define echo do{if(fwrite(yytext,(size_t)yyleng,1,yyout)){}}while(0)
    #define jmp (yy_start) = 1 + 2 *
    int fileno (FILE *);
    int ischunked,chunksize,count;
   xa "\15"|"\12"
   xb "\15\12" 
   xc "HTTP/0.9"|"HTTP/1.0"|"HTTP/1.1"
   xd [Cc][Hh][Uu][Nn][Kk][Ee][Dd]
   xe [0-9a-fA-F]+\r\n
   xf [0-9a-fA-F]*\r\n
   %option noyywrap nounput noinput 
   %s xb xc xd xe xf
   %%
   ^{xc} echo;ischunked=0;jmp xc;
   <xc>^transfer-encoding: echo;jmp xb;
   <xb>\r\n\r\n echo;jmp xe;
   <xb>{xd} echo;ischunked=1;
   <xe>{xf}|{xe} {
   count=0;
   if(ischunked==1)
   {chunksize=strtol(yytext,NULL,16);
   jmp xd;};
   };
   <xd>{xb} jmp xf;
   <xd>. { 
   count++;
   if(count==chunksize)jmp xe;
   echo;
   };
   <xf>^[A-Fa-f0-9]+{xa}
   <xf>{xa}+[A-Fa-f0-9]+{xa}
   <xf>{xb}[A-Fa-f0-9]+{xb}
   %%
   int main(){ yylex();exit(0);}

textmode | 4 years ago | on: A call to minimize distraction and respect users’ attention (2013)

Below is a short script that downloads and makes a PDF from the image files. No browser required.

The script uses a feature of HTTP/1.1 called pipelining; proponents of HTTP/2 and HTTP/3 want people to believe it has problems because it does not fit their commercialised web business model. As demonstrated by the script below, it has no problems. It's a feature that simply does not suit the online ad industry-funded business model with its gigantic corporate browser, bloated conglomeration web pages and incessant data collection. Here, only 2 TCP connections are used to retrieve 141 images. Most servers are less restrictive and allow more than 100 requests per TCP connection. Pipelining works great. Much more efficient than browsers which open hundreds of connections. IMHO.

    (export Connection=keep-alive
    x1=http://www.minimizedistraction.com/img/vrg_google_doc_final_vrs03-
    x2(){ seq -f "$x1%g.jpg" $1 $2;};
    x3(){ yy025|nc -vvn 173.236.175.199 80;};
    x2   1 100|x3;
    x2 101 200|x3;
    )|exec yy056|exec od -An -tx1 -vw99999|exec tr -d '\40'|exec sed 's/ffd9ffd8/ffd9\
    ffd8/g'|exec sed -n /ffd8/p|exec split -l1;
    for x in x??;do xxd -p -r < $x > $x.jpg;rm $x;done;
    convert x??.jpg 1.pdf 2>/dev/null;rm x??.jpg

    ls -l ./1.pdf
More details on yy025 and yy056 here: https://news.ycombinator.com/item?id=27769701

textmode | 4 years ago | on: Althttpd: Simple webserver in a single C file

https://news.ycombinator.com/item?id=27490265 <-- yy054

The "gibberish" is GZIP compressed data. "yy054" is a simple filter I wrote to extract a GZIP file from stdin, i.e., discard leading and trailing garabage. As far as I can tell, the compressed file "ee.txt" is not chunked transfer encoded. If it was chunked we would first extract the GZIP, then decompress and finally process the chunks (e.g., filter out the chunk sizes with the filter submitted in the OP).

In this case all we need to do is extract the GZIP file "ee.txt" from stdin, then decompress it:

    printf "GET /ee.txt\r\nHost: stuff-storage.sfo3.digitaloceanspaces.com\r\nConnection: close\r\n\r\n"|openssl s_client -connect 138.68.34.161:443 -quiet|yy054|gzip -dc > 1.htm
    firefox ./1.htm
   
Hope this helps. Apologies I initially guessed wrong on here doc. I was not sure what was meant by "gibberish". Looks like the here doc is working fine.

textmode | 4 years ago | on: Althttpd: Simple webserver in a single C file

Need to get rid of the leading spaces on all lines except the "int fileno" line. Can also forgo the "here doc" and just save the lines between "flex" and "eof" to a file. Run flex on that file. This will create lex.yy.c. Then compile lex.yy.c.

The compiled program is only useful for filtering chunked transfer encoding on stdin. Most "HTTP clients" like wget or curl already take care of processing chunked transfer encoding. It is when working with something like netcat that chunked tranfser encoding becomes "DIY". This is a simple program that attempts to solve that problem. It could be written by hand without using flex.

textmode | 4 years ago | on: Althttpd: Simple webserver in a single C file

The extra "a" is a typo but would have no effect. The "i" is also superfluous but harmless. Without more details on the "gibberish" it is difficult to guess what happened. The space before "int fileno (FILE *);" is required. All the other lines must be left-justified, no leading spaces, except the line with "int main()" which can be indented if desired.

textmode | 4 years ago | on: Althttpd: Simple webserver in a single C file

I make most HTTP requests using netcat or similar tcp clients so I write filters that read from stdin. Reading text files with the chunk sizes in hex interspersed is generally easy. Sometimes I do not even bother to remove the chunk sizes. Where it becomes an issue is when it breaks URLs. Here is a simple chunked transfer decoder that reads from stdin and removes the chunk sizes.

   flex -8iCrfa <<eof
    int fileno (FILE *);
   xa "\15"|"\12"
   xb "\15\12" 
   %option noyywrap nounput noinput 
   %%
   ^[A-Fa-f0-9]+{xa}
   {xa}+[A-Fa-f0-9]+{xa}
   {xb}[A-Fa-f0-9]+{xb} 
   %%
   int main(){ yylex();exit(0);}
   eof

   cc -std=c89 -Wall -pipe lex.yy.c -static -o yy045
Example

Yahoo! serves chunked pages

   printf 'GET / HTTP/1.1\r\nHost: us.yahoo.com\r\nConnection: close\r\n\r\n'|openssl s_client -connect us.yahoo.com:443 -ign_eof|./yy045

textmode | 4 years ago | on: Binary to hex faster than xxd, part 2 of 2

    static void flush(void) {                                                                                                                 
      if (writeall(1, buf, buflen) == -1) _exit(errno);                                                                                     
      buflen = 0;                                                                                                                           
      }                                                                                                                                         
    static void wrch(const char ch) {                                                                                                         
      if (buflen >= sizeof buf) flush();                                                                                                    
      buf[buflen++] = ch;                                                                                                                   
      return;                                                                                                                               
     }                                                                                                                                         
    char inbuf[128];
    int main(int argc, char **argv) {
        long long r, i;
        for (;;) {
            r = read(0, inbuf, sizeof inbuf);
            if (r == -1) _exit(errno);
            if (r == 0) break;
            for (i = 0; i < r; ++i) {
                wrch("0123456789abcdef"[15 & (inbuf[i] >> 4)]);
                wrch("0123456789abcdef"[15 & inbuf[i]]);
            }
        }
        wrch('\n');
        return 0;
    }

textmode | 4 years ago | on: Binary to hex faster than xxd, part 1 of 2

    #include <unistd.h>
    #include <errno.h>
    #include <sys/types.h>
    int writeall(int fd,const void *xv,long long xlen)
    {
      const unsigned char *x = xv;
      long long w;
      while (xlen > 0) {
        w = xlen;
        if (w > 1048576) w = 1048576;
        w = write(fd,x,w);
        x += w;
        xlen -= w;
      }
      return 0;
    }
    static int hexdigit(char x)
    {
      if (x >= '0' && x <= '9') return x - '0';
      if (x >= 'a' && x <= 'f') return 10 + (x - 'a');
      if (x >= 'A' && x <= 'F') return 10 + (x - 'A');
      return -1;
    }
    int hexparse(unsigned char *y,long long len,const char *x)
    {
      if (!x) return 0;
      while (len > 0) {
        int digit0;
        int digit1;
        digit0 = hexdigit(x[0]); if (digit0 == -1) return 0;
        digit1 = hexdigit(x[1]); if (digit1 == -1) return 0;
        *y++ = digit1 + 16 * digit0;
        --len;
        x += 2;
      }
      if (x[0]) return 0;
      return 1;
    }

textmode | 5 years ago | on: Chrome is deploying HTTP/3 and IETF QUIC

Glad others are starting to articulate this issue. HTTP/3 is derived from HTTP/2. Google's main argument for HTTP/2's existence, its selling point to users, is head-of-line blocking in HTTP/1.1 pipelining. They also complain about the size of repeated HTTP headers.

But no modern browsers actually use HTTP/1.1 pipelining. Interestingly, HTTP/1.1 pipelining works great for non-browser use. Most web servers enable it by default. After all, it works. For example requesting a series of pages from multi-page website, all under a single TCP connection. I have been using HTTP/1.1 pipeplining this way for decades. It is fast and reliable and enables the web to be used as a non-interactive, information retrieval source. It is also 100% ad-free. The user only gets what she requests, nothing more.

As for HTTP headers, privacy-conscious or minimalist users might not send many headers, only the minimum to retrieve the page. That's usually up to three extra lines of text per page for the request headers. (I rarely ever have to send a User-Agent header for HTTP/1.1 pipelining.)

   GET /index.html HTTP/1.1
   Host: example.com
   Connection: keep-alive
Obviously, the web advertising/tracking industry, including companies like Google that serve this sector, use headers for their own purposes. Online advertising services. That's when presumably they could get big. However, as a user, I have no pressing need for the ability to send/receive larger headers.

Websites (IPs represented by domain names) to which users intentionally connect, i.e., the recognisable names that they type and click on, generally don't serve ads. The ads come from other domains, often other servers. Users generally do not intentionally try to connect to ad or tracking servers. HTTP/[01].x's automatic loading of resources, Javascript and other techniques may be used to make those requests, conveniently under the radar and outside the user's awareness.

Still, under HTTP/1.1, ads, nor Javascript files that trigger requests for ads, generally cannnot be delivered without the user's computer making a request first. Users can and do manage to exercise some control over their computers and they can prevent these non-interactive requests from being sent, from inside and outside the browser.

With HTTP/2 and HTTP/3, the necessity of a user-generated request disappears. As soon as the user "connects" (UDP) to the website's server, the server could for example send a Javascript file to the user's browser which can in turn trigger requests to other domains for ads or the purpose of tracking, all without any preceding request for the ad/tracking-related Javascript file. This is another feature of HTTP/[23] called "server push", but interestingly it is not the feature being used to sell HTTP/[23] to users (i.e., pipelining).

So, how does a user stop unwanted ads being "pushed" upon her in the stream (irrespective of the application, e.g., browser)? I generally don't use a "modern" browser, nor Javascript nor graphics. I like my pipelining outside the browser and free of advertising-related cruft.

It's worth considering that the motivation for speeding up websites via HTTP/[23] is solely for the purpose of speeding the delivery of more ads, more "stealthily", to users. This is a classic case of someone trying to sell you on a "solution" to a problem they themselves have created (or to which they are contributing).

Like an ISP trying to upsell customers to faster internet in order that websites bogged down with ads will "load" faster. When the ISP itself injects ads into pages of websites that are weighed down by ads.

textmode | 5 years ago | on: A safer and more private browsing experience with Secure DNS

"Instead of fetching a specific site, you fetch blocks of sites."

I have been doing this for many years, putting bulk DNS data in HOSTS and personal use zone files served from loopback addresses. It is easier than ever today with so many sources of bulk DNS data.

DOH now lets users retrieve DNS data from recursive DNS servers (caches) in bulk, using HTTP/1.1 pipelining. Here is a working example: https://news.ycombinator.com/item?id=23242389

Many years ago, I started doing non-recursive (no caches used) bulk DNS data retrieval for speed and also for resiliency in the event of outages. However the privacy gains are obvious. A rough analogy is downloading all of Wikipedia in bulk and browsing articles offline as opposed to making separate requests online for each article and generating all the requisite DNS and TCP/HTTP traffic. Openmoko's Wikireader experimented with the idea of offline Wikipedia.

Not only does the DOH provider get a record of all the user's DNS lookups, she can now associate each request with the particular user program/device that made it.

textmode | 6 years ago | on: Stop Infinite Scrolling

   Corrections:
   /tcs/s//openssl s_client -ign_eof -connect/;s/.com/&:443/
   s/1|2/titles|urls/
   s/;;1/;;titles/
   s/;;2/;;urls/

textmode | 6 years ago | on: Stop Infinite Scrolling

One of the websites where one can find this annnoying "infinite scroll" is YouTube channels.

I wrote a quick and dirty script to address this annoyance.

It can be used to output a table of all the video urls and video titles for any YouTube channel.

"yy032" and "yy025" are some utilities I wrote to decode html and transform urls to HTTP for HTTP/1.1 pipelining, respectively.1 Instead of using yy025 and openssl, one could alternatively make a separate TCP connection for each HTTP request, e.g., using something like curl. Personally, I prefer not to make lots of connections when retrieving mutiple pages from the same domain.

Here is a hypothetical example of how to use the script, "1.sh", to make a table of all the video urls and video titles in a channel.

   echo https://www.youtube.com/user/example/videos|sh 1.sh|yy025|openssl s_client -connect www.youtube.com:443 > 2.html
   sh 1.sh urls < 2.html > example.1
   sh 1.sh titles < 2.html > example.2
   rm 2.html
   paste -d '\t' example.1 example.2

   # 1.sh
   case $1 in
   "")
   exec 2>/dev/null
   export Connection=close
   yy025|tcs www.youtube.com |sed 's/%25/%/g'|yy032 > 1.html
   while true;do
   x=$(sed 's/%25/%/g;s/\\//g' 1.html|yy032|grep -o "[^\"]*browse_ajax[^\"\\]*"|sed 's/u0026amp;/\&/g;s/&direct_render=1//;s,^,https://www.youtube.com,');
   echo > 1.html;
   test ${#x} -gt 100||break
   echo "$x" 
   echo "$x"|yy025|openssl s_client -connect www.youtube.com:443 -ign_eof > 1.html
   done;
   rm 1.html;
   ;;-h|-?|-help|--help) echo usage: echo https://www.youtube.com/user/example/videos \|$0 ;echo usage: $0 "[1|2]" \< 2.htm
   ;;1) sed 's/\\//g;s/u0026amp;//g;s/u0026quot;//g;s/u0026#39;//g'|grep -o "ltr\" title=\"[^\"]*"|sed 's/ltr..title=.//'  
   ;;2) sed 's/\\//g;s/u0026amp;//g;s/u0026quot;//g'|grep -o "[^\"]*watch?v=[^\"]*"|sed 's,^,https://www.youtube.com,'|uniq
   esac
1 https://news.ycombinator.com/item?id=17689165 https://news.ycombinator.com/item?id=17689152
page 1