textmode's comments

textmode | 6 years ago | on: The IA Client – The Swiss Army Knife of Internet Archive

You are probably thinking of pipelining in terms of the popular web browsers. Those programs want to do pipelining so they can load up resources (read: today, ads) from a variety of domains in order to present a web page with graphics and advertising.

That never really worked. Thus, we have HTTP/2, authored by an ad sales company. It is very important for an ad sales company that web pages contain not only what the user is requesting but also heaps of automatically followed pointers to third party resources hosted on other domains. That is, pages need to be able to contain advertising. HTTP/1.1 pipelining is of little benefit to the ad ecosystem.

However, sometimes the user is not trying to load up a graphical web page full of third party resources. Here, the HN commenter is just trying to get some HTML, extract some URLs and then download some files. The HTML is all obtained from the same domain. This is text retrieval, nothing more.

If all the resources the user wants are from the same domain, e.g., archive.org, then pipelining works great. I have been using HTTP/1.1 pipelining to do this for several decades and it has always worked flawlessly.

Typically httpd settings for any website would allow at least 100 pipelined requests per connection. As you might imagine, often the httpd settings are just unchanged defaults. Today the limits I see are often much higher, e.g., several hundred.

It is very rare in my experience to find a site that has pipelining disabled. More likely they are disabling Connection: keep-alive and forcing all requests to be Connection: close. I rarely see this.

The HTTP/1.1 specification suggests a max connection limit per browser of two. There is no suggested limit on the number of requests per connection. In terms of efficiency, the more the better. How many connections does a popular we browser make when loading an "average" web page today? It is a lot more than two! In any event, pipelining as I have shown here stays under the two connection limit.

textmode | 6 years ago | on: Making single-purpose utilities example: filter URLs from input

Yeah, I am pretty good with sed. Probably better than you. I have sed versions of all these programs.

I am not using any computers that cannot run flex and cc.

Results usually go to yy025, a program that makes http from urls.

textmode | 6 years ago | on: Making single-purpose utilities example: filter URLs from input

Of course I use grep -o too. This is not a "correct" filter. It is not a perfect regexp for 100% of urls.

However for something as simple and essential (for the author) as filtering urls I do not want to always have to worry about potential differences in shells, different versions of grep or the absence of a grep as I use different computers, different OS or OS versions. I find this more predictable and portable.

Neither customization nor extensibility are goals. For that a scripting language is better suited.

textmode | 6 years ago | on: The IA Client – The Swiss Army Knife of Internet Archive

# If are not a Python user or want to try something different (faster), can be done with sh, sed, openssl, curl/wget/etc. plus a simple utility I wrote called "yy025" (https://news.ycombinator.com/item?id=17689152). yy025 is a more generalised "Swiss Army Knife" for making requests to any website. This solution uses a traditional method called "http pipelining".

   export Connection=keep-alive;
   n=1;while true;do
   test $n -le 8||break
   echo https://archive.org/details/computerchronicles?\&sort=-downloads\&page=$n
   n=$((n+1));done \
   |yy025|openssl s_client -connect archive.org:443 -ign_eof \
   |sed '/item-ia\" [^ ]/!d;s,.*=\",,;s/\"//;s,.*,https://archive.org/download/&/&_archive.torrent,' \
   |yy025|openssl s_client -connect archive.org:443 -ign_eof|sed '/Location:/!d;s/Location: //'

# Additional command-line options for openssl s_client omitted for sake of brevity. The above outputs the torrent urls. Feed those to curl or wget or whatever similar program you choose, or maybe directly to a torrent client. Something like

   |while read url;do curl -O $url;done

textmode | 7 years ago | on: Intel Publishes Microcode Patches, No Benchmarking or Comparison Allowed

For discussion...

Question: Are these license restrictions on right to disclose benchmarks enforceable?

Question: If they are enforceable, do licensors ever try to enforce them? If not, why?

A little background here: https://danluu.com/anon-benchmark/

For example, this has been posted to HN at least twice recently:

https://clemenswinter.com/2018/07/09/how-to-analyze-billions...

Question: Was the author subject to any restrictions on publication? If yes, did the author seek "permission" from the licensor to publish these findings?

Excerpts from some of the licenses:

2.2. 32 Bit Kdb+ Software Use Restrictions

(c) 32 Bit Kdb+ Software Evaluations. User shall not distribute or otherwise make available to any third party any report regarding the performance of the 32 Bit Kdb+ Software, 32 Bit Kdb+ Software benchmarks or any information from such a report unless User receives the express prior written consent of Kx to disseminate such report or information.

kdb+ on demand - Personal Edition [64-bit]

1.3 Kdb+ On Demand Software Performance. End User shall not distribute or otherwise make available to any third party any report regarding the performance of the Kdb+ On Demand Software, Kdb+ On Demand Software benchmarks or any information from such a report unless End User receives the express, prior written consent of Kx to disseminate such report or information.

This Kdb+ Software Academic Use License Agreement ("Agreement") is made between Kx Systems, Inc. ("Kx") and you, the University, or employee of the University ("End User") with respect to Kx's 64 bit Kdb+ Software and any related documentation that is made available to you in (jointly, the "Kdb+ Software"). You agree to use the Kdb+ Software under the terms and conditions set forth below. This Agreement is effective upon you clicking the "I agree" button below.

1. LISCENSE GRANTS [sic]

1.4 Kdb+ Software Evaluations. End User shall not distribute or otherwise make available to any third party any report regarding the performance of the Kdb+ Software, Kdb+ Software benchmarks or any information from such a report unless End User receives the express, prior written consent of Kx to disseminate such report or information.

Kdb+ software end-user agreement:

By accessing the Kdb+ Software via the Google platform, you are agreeing to be bound by these terms and conditions (which may be updated from time to time) and to the extent you are acting on behalf of a permitted organization that you have authority to act on their behalf and bind them to these terms and conditions.

You may not access the Kdb+ Software if you are a direct competitor of Kx.

4. Benchmark Test Results. User agrees not to disclose benchmark, test or performance information regarding the Kdb+ Software to any third party except as explicitly authorized by Kx in writing.

textmode | 7 years ago | on: Dear Ad Networks

Current Intel license:

"Unless expressly permitted under the Agreement, You will not, and will not allow any third party to ... (v) publish or provide any Software benchmark or comparison test results."

If the author uses the ad servers to conduct benchmark or comparison tests of Intel software, and the ad networks allow the author to provide or publish the results of those tests, then it could be argued the ad networks are violating their license agreement with Intel. As a preemptive measure to prevent such testing, perhaps the ad networks would block the author's IP address.

The language in the Intel license probably was inspired from similar language first used by Oracle. This language is commonly copied and pasted into many software license agreements.

http://www.eweek.com/c/a/Application-Development/DB-Test-Pio...

Question: Is this type of restriction enforceable?

The only way to answer this is for end-users to challenge it. There was a case where a state attorney general challenged it because it was used in a deceptive way. The AG won. However the AG was not challenging this restriction as an end-user. The Court appeared to suggest the restriction would be unenforceable, but was not asked to decide that question. The question was whether the state's consumers were being mislead. Excerpts of that case below.

   Excerpts from  http://www.leagle.com/decision/2003579195Misc2d384_1519.xml

   195 Misc.2d 384 (2003)

   758 N.Y.S.2d 466

   Supreme Court, New York County.

   January 6, 2003.

     ----------------------------------------------------------------------

   Supreme Court, New York County.

    OPINION OF THE COURT

   MARILYN SHAFER, J.

Network Associates included on the face of many of its software diskettes and on its download page on the Internet the following restrictive clause:

"Installing this software constitutes acceptance of the terms and conditions of the license agreement in the box. Please read the license agreement before installation. Other rules and regulations of installing the software are: "a. The product can not be rented, loaned, or leased-you are the sole owner of this product. "b. The customer shall not disclose the result of any benchmark test to any third party without Network Associates' prior written approval. "c. The customer will not publish reviews of this product without prior consent from Network Associates, Inc." (Affirmation of Kenneth M. Dreifach, exhibit 2.)

In July 1999, Network World Fusion, an online magazine, published a comparative review of six firewall software products, including Network Associates' Gauntlet. It appears that Network World Fusion sought permission to publish the review of Gauntlet and that Network Associates denied it. Network World Fusion performed the review despite Network Associates' refusal to allow the review of Gauntlet. In response to the unsatisfactory results of the review, Network Associates communicated its protest, quoting the language of the restrictive clause.

This conduct prompted an investigation by the office of the Attorney General of the State of New York.

"This language implies that limitations on the publication of reviews do not reflect the policy of Network Associates, but result from some binding law or other rules and regulations imposed by an entity other than Network Associates."

Assume for the sake of discussion, there is some such entity.

That is, assume some entity (e.g., Oracle, Microsoft, Intel, etc.) has a license restriction prohibiting publication of benchmark results.

Does the Court think that restriction would be enforceable?

"Thus, the Attorney General has made a showing that the language at issue may be deceptive, and as such, the language is not merely unenforceable, but warrants an injunction and the imposition of civil sanctions according to Executive Law S: 63 (12) and General Business Law S: 349."

Is the Court here suggesting that even if the restriction was not deceptive, it is nevertheless unenforceable.

Is it possible to read that sentence as suggesting the restriction has the qualities of being both unenforceable and deceptive.

As to unenforceability, no users challenged the enforceability of the restriction. Until they do, we cannot answer the question of enforceability.

However, as to deceptiveness, this can be a violation of state business laws and give rise to grounds for injunction and civil sanctions. This is what allowed the NY AG to take action on behalf of NY state consumers.

AG won. NA lost.

The Court granted a permanent injunction prohibiting NA from ever including the following notice with its software:

"Installing this software constitutes acceptance of the terms and conditions of the license agreement in the box. Please read the license agreement before installation. Other rules and regulations of installing the software are: "a. The product can not be rented, loaned, or leased-you are the sole owner of this product. "b. The customer shall not disclose the result of any benchmark test to any third party without Network Associates' prior written approval. "c. The customer will not publish reviews of this product without prior consent from Network Associates, Inc.";

The injunction also prohibits NA from "including any language restricting the right to publish the results of testing and review without notifying the Attorney General at least 30 days prior to such inclusion". NA was directed "to provide a sworn certified statement indicating the number of instances in which software was sold on discs or through the Internet containing the above-mentioned language in order for the court to determine what, if any, penalties and costs should be ordered."

textmode | 7 years ago | on: Edge Computing at Chick-fil-A

"Raspi cannot be used for a semi serious server application due to microsd card write wear issues."

As an end user, I run a personal authoritative DNS server that has small RAM requirements. The RPi (or other SBC) boots to a mfs mounted root, then mounts all directories as tmpfs. Then I remove the SD card.1 As such, the logs for this server, which are automatically rotated and do not exceed 5M in total, are written to RAM.

1 I only use the SD card to boot. The only files on the card are a bootloader, a bootloader config and two kernels, each with an embedded filesystem. If updates are necessary, I make them to one of the kernels at a time. The other is the backup. The bootloader and bootloader config lets me specify which kernel to boot.

textmode | 7 years ago | on: Shamir's Secret Sharing

http://web.archive.org/web/20031002050746/http://atrey.karli...

textmode | 7 years ago | on: Ask HN: Simple, beginner friendly ETL / Data Engineering project ideas?

What are the industries that have the highest costs for ETL/Data Engineering?

What companies in these industries are interested in reducing their costs for this work?

"costs" as used above includes time expenditures as well as spending money

http://web.archive.org/web/19991023120316/http://www.dbmsmag...

textmode | 7 years ago | on: Plan to replicate 50 high-impact cancer papers shrinks to just 18

There is an unwritten rule in biology that if you publish a paper that refers to uses of certain reagents that are not commercially available, then you are obligated to provide those reagents to other investigators who read the paper and request them. There can also be an expected obligation that the other researchers will share any data they generate using the reagents with the original authors.

Outside of biology, I have seen many "academic" papers published on computer-related topics that refer to software programs developed by the papers' authors that are crucial to the research but not publicly available. Is there any similar unwritten rule to that in biology where another researcher reading these papers can request a copy of these programs from the authors?

Obviously, in many cases other researchers cannot replicate and verify findings without access to the same research tools used in the published papers.

textmode | 7 years ago | on: Freeing the Web from the Browser

"Different people have different perspectives on how information should be connected, so why do we not allow these range of perspectives to be represented and shared digitally? Why limit ourselves to just one point of view?

...

Why re-create code editors, simulators, spreadsheets, and more in the browser when we already have native programs much better suited to these tasks?"

The title is something I contemplated and began to address long ago, only on a personal level.

With respect to the first question, perhaps this goes to the poor mechanism promoted by Google, to rank the www's contents by "popularity".

This mechanism obviously succeeds for purposes of measuring www user opinion and selling advertising (the later not anticipated by the founders in the early years). However it falls short in the non-commercial context, e.g., the academic setting out of which the company grew. Anyone remember "Knol"?

Today Google search (and probably others seeking to emulate its commercial success) intentionally promote a pattern of usage of their cache/database where its users never reach "page 2" of search results. The company has built their ad sales business on the idea that one perspective ("the top search result") should not only prevail but also that, optimally, other results need not even be considered. It should be obvious that in a non-commercial research context, this is not optimal.

If the www is 100% commercial then of course this is not an issue. But "the www" is difficult to define. All httpd's on any accessible network? All httpd's listening on accessible addresses with corresponding ICANN-registered domainnames? All pages crawled by a commercial bot, deposited in a commercial www cache and made accessible to the public? And so on. In any event, if users only view the www's supposed contents through the lense of a commercial entity, the perception of what the www actually comprises may be manipulated in a way that suits commercial interests, e.g. the sale of advertising.

As to the second question, when given the choice I do not use a popular web browser. The author mentions the utility of "native programs". I would prefer the term "dedicated programs". Programs that perform essentially one task, or "do one thing". Whether such programs can perform their dedicated tasks better than an omnibus-styled program that performs many, varied tasks is a question for the user to decide. For example, the author answers that native programs are "better suited" than the web browser.

The "web browser" has become a conglomeration of once dedicated programs.

There are such dedicated programs for making TCP connections over which HTTP commands can be sent and www content retrieved. This is a task that web browsers can perform, although some users may prefer a dedicated program. In this way content retrieval can be separated from content consumption, alleviating many of the www annoyances such as user tracking, manipulation and advertising.

textmode | 7 years ago | on: Internet Archive, decentralized

Is there a delay between the time the robots.txt changes and the time when the content becomes inaccessible via Wayback Machine? How often does the archive.org_bot crawl robots.txt?

Can a script check robots.txt periodically for changes and if changes are detected, then download the content from Wayback Machine before it becomes inaccessible?

Additionally, can a script check the domain registration for an anticipated expiration date, or perhaps monitor domainname "drop lists"?

textmode | 7 years ago | on: The default OpenSSH key encryption is worse than plaintext

I have been using tinysshd for a number of years and I am hooked. Keen to experiment, I have also been using ed25519 keys instead of rsa since this option was added to openssh. No one told me to use tinysshd or ed25519 keys. As someone else pointed out, it seems like most "guides" on ssh, even ones written after ed25519 was added, still advocate rsa keys.

textmode | 7 years ago | on: Facebook’s New Message to WhatsApp: Make Money

In an interview a number of years ago, one of the founders of WhatsApp said the $1 fee was actually not a business model. I believe the interview is on YouTube.

He said he instituted the $1 fee to try to slow down a relentless increase in new users because he was afraid of potential outages.

textmode | 7 years ago | on: The Bullshit Web

"I don't think there is anything wrong with user agents downloading resources (like images and stylesheets) linked to by an html document."

Neither do I. For some websites, this is both necessary and appropriate.

However, in cases where the user does not want/need these resources, or where she does not trust the provider, I do not think there is anything wrong with not downloading images, stylesheets, unnecessary scripts, fonts, spyware, advertisements, etc.

textmode | 7 years ago | on: The Bullshit Web

"I just loaded the New York Times front page. It was 6.6mb."

   ftp -4o 1.htm https://www.nytimes.com

   du -h 1.htm

   206K

For the author, 206K somehow grew to 6.6M.

Could it have anything to do with the browser he is using?

Does it automatically load resources specified by someone other than the user, without any user input?

Above I specified www.nytimes.com. I did not specify any other sources. I got what I wanted: text/html. It came from the domain I specified. (I can use a client that does not do redirects.)

But what if I used a popular web browser to download the front page?

What would I get then? Maybe I would get more than just text, more than 206K and perhaps more from sources I did not specify.

If the user wants application/json instead of text/html, NYTimes has feeds for each section:

    curl  https://static01.nyt.com/services/json/sectionfronts/$1/index.jsonp

where $1 is the section name, e.g., "world".

The user can use the json to create the html page she wants, client-side. Or she can let a browser javascript engine use it to construct the page that someone else wants, probably constructing it in a way that benefits advertisers.

textmode | 7 years ago | on: The Bullshit Web

Salesforce uses a large quantity of DNS indirection, more than even the large CDNs. I measure the amount of lookups that sites and apps require, and the delay it causes. Most sites on the www only require two lookups.

This is perhaps an example of "... the benefits primarily accrue to the developers of the app and not to the customer."

textmode | 7 years ago | on: How F5Bot Slurps All of Reddit

I apologise if I confused you. I was simply wondering why he is not using pipelining, which IME can be ideal for the sort of text retrieval he is performing.

textmode | 7 years ago | on: How F5Bot Slurps All of Reddit

You might be right.

With the "&limit" parameter he can change how many items he receives per HTTP request. This has nothing to do with a limit on how many HTTP requests he can make per TCP connection (pipelining). Maybe that is the "100" he is complaining about, i.e., 100 items per HTTP request.

However you failed to answer my question: Is he making 100 TCP connections to make 100 HTTP requests?

Does the Reddit server set a limit on how many HTTP requests he can make per connection? (100 is a common limit for web servers)

Sometimes the server admins may set a limit of 1 HTTP request per TCP connection. This prevents users from pipelining outside the browser, e.g., with libcurl or some other method.

textmode | 7 years ago | on: How F5Bot Slurps All of Reddit

"Turns out that Reddit [API] has a limit. It'll only show you 100 posts at a time."

100 sounds like a typical "max-requests" pipelining limit.

He does not mention CURLMOPT_PIPELINING.

Does this mean he makes 100 TCP connections in order to make 100 HTTP requests?