They do not try to blame it on complex systems or other factors.
Users lost 1 day and 1/2 of recent work (which doesn't seem to be that bad).
About file loss in Luster file system in your supercomputer system, we are 100% responsible.
We deeply apologize for causing a great deal of inconvenience due to the serious failure of the file loss.
We would like to report the background of the file disappearance, its root cause and future countermeasures as follows:
We believe that this file loss is 100% our responsibility.
We will offer compensation for users who have lost files.
[...]
Impact:
--
Target file system: /LARGE0
Deleted files: December 14, 2021 17:32 to December 16, 2021 12:43
Files that were supposed to be deleted: Files that had not been updated since 17:32 on December 3, 2021
[...]
Cause:
--
The backup script uses the find command to delete log files that are older than 10 days.
A variable name is passed to the delete process of the find command.
A new improved version of the script was applied on the system.
However, during deployment, there was a lack of consideration as the periodical script was not disabled.
The modified shell script was reloaded from the middle.
As a result, the find command containing undefined variables was executed and deleted the files.
[...]
Further measures:
--
In the future, the programs to be applied to the system will be fully verified and applied.
We will examine the extent of the impact and make improvements so that similar problems do not occur.
In addition, we will re-educate the engineers in charge of human error and risk prediction / prevention to prevent recurrence.
We will thoroughly implement the measures.
Japanese companies structure apologies very differently from US ones, because the legal consequences are very different. In the US, an apology is considered an admission of responsibility and is often the starting point of legal action against the culprit, while in Japan, a sufficiently sincere* apology may well defuse the situation entirely.
* 真 makoto, a word often glossed as "sincere" but not identical in meaning: it's about the amount of effort you're willing to take on, not how "honestly" you feel something
Also, the culprit here is not HP proper but their consulting/SI wing HP Enterprise, which has a, uhh, less than stellar reputation for competence.
So this is something I’ve never understood. If you modify a shell script while it’s running, the shell executes the modified file. This normally but not always causes the script to fail.
Now I’ve known about this behaviour for a very long time and it always seemed very broken to me. It’s not how binaries work (at least not when I was doing that kind of thing).
So I guess bash or whatever does an mmap of the script it’s running, which is presumably why modifications to the script are visible immediately. But if a new file was installed eg using cp/tar/unzip, I’m surprised that this didn’t just unlink the old script and create a new one - which would create a new inode and therefore make the operation atomic, right? And this (I assume) is why a recompiled binary doesn’t have the same problem (because the old binary is first unlinked).
So, how could this (IMO) bad behaviour be fixed? Presumably mmap is used for efficiency, but isn’t it possible to mark a file as in use so it’s cant be modified? I’ve certainly seen on some old Unices that you can’t overwrite a running binary. Why can’t we do the same with shell scripts?
Honestly, while it’s great that HP is accepting responsibility, and we know that this happens, the behaviour seems both arbitrary and unnecessary to me. Is it fixable?
Ahhh the joy of lustre and the accidental cronjob.
about 15 years ago I experienced the same thing. An updater script based on rsync was trying to keep one nfs machine image in sync with another. However for what ever reason, the script accidentally tries to sync the entire nfs root directory with its own, deleting everything show by show in reverse alphabetical order.
At the time Lustre didn't really have any good monitoring tools for showing you who was doing what, so they had to wait till they hit a normal NFS server before they could figure out and stop what was deleting everything.
Needless to say, a lot of the backups may have been failing.
Huh. I may be remembering incorrectly, but I recall having somebody somewhat entrenched in related business tell me that HP has been going downhill from an industry perspective roughly two years ago…
Nice to see them completely own up to the mistake right away. I wonder who made the final call on doing so, companies admitting fault so transparently & immediately offering recourse seems pretty damn rare anymore.
Without the intent of sounding xenophobic, I wonder if it’s because it’s HP Japan where reputation is much more culturally important. US MBA’s admitting fault… haha…
Just pointing out that those are most likely just the days the files were saved. There could still be some unlucky souls that ran computations for several days/weeks that happened to terminate on those days (and store the results). Those people could lose significantly more than a day and a half. On the flip side, HP jobs tend to be frequently checkpointed unless the storage cost is prohibitive for the type of job.
> However, during deployment, there was a lack of consideration as the cronjob was not disabled.
I'm intrigued to see that the report you link (which is in Japanese) mentions `find` and `bash` by those names, but doesn't contain the word `cron`. How does the report refer to the idea of a "cronjob"? Why is it different?
The style of apology is very nice. It is not extensive as some technical post mortem analysis that I've read, but all of the important things are here.
And always, always, use ShellCheck (https://www.shellcheck.net/) to catch most pitfalls and common mistakes on this powerful but dangerous language that is shell scripting.
[^1]: I think this gist is better than the original article in which it is based, because the article also suggested changing the IFS variable, which is not that good of an advice, so sadly the original text becomes a bad recommendation!
Everyone is mentioning error control for shell scripts or "don't use shell scripts", but neither of those are the solution to this problem. The solution to this problem is correctly implementing atomic deployment, which is important for any system using any programming language.
What I like to do is have two directories I ping pong between when deploying, and a `cur` symlink that points to the current version. The symlink is atomically replaced (new symlink and rename it over) whenever the deploy process completes. Any software/scripts using that tree will be written to first chdir() in, which will resolve the symlink at that time, and thus won't be affected by the deploy (at least as long as you don't do it twice in a row; if that is a concern due to long running processes, you could use timestamped directories instead and a garbage collection process that cleans stuff up once it is certain there are no users left).
>the find command containing undefined variables was executed and deleted the files
Just a note that "set -u" at the beginning of a bash script will cause it to throw an error for undefined variables. warning that of course this should be tested as it will also cause [[ $var ]] to fail.
If that's the case
[ -z "${VAR:-}" ] && echo "VAR is not set or is empty" || echo "VAR is set to $VAR"
I've been a Linux coder and user forever, and I didn't know that bash "reloads" a script while running if the file is modified. Good to learn before I also delete a whole filesystem due to this! :)
However, during deployment, there was a lack of consideration as the periodical script was not disabled.
The modified shell script was reloaded from the middle.
In my opinion, this is the wrong takeaway, and an important lesson was not learned.
It's not an operator "lack of consideration".
The lesson should be "when dealing with important data, do not use outrageously bad programming languages that allow run-time code rewriting, and that continue to execute even in the presence of undefined variables".
If you use shell scripting, this is bound to happen, and will happen again.
"We'll use Python or anything else instead of shell" would fundamentally remove the possibility of this category of failure.
> outrageously bad programming languages that allow run-time code rewriting
Almost all languages allow run-time code rewriting. Some of them just make it easier than others, and some of them make it a very useful feature. If you're very careful, updating a bash script while you're running it can be useful, but most often it's a mistake; in Erlang, hot loading is usually intentional and often useful. Most other languages don't make it easy, so you'll probably only do it if it's useful.
The problem was not that they used shell scripts. The problem was that the people writing the shell scripts were just bad programmers. If you hire a bad programmer to write them in Python, they'll still have tons of bugs.
The shell scripts I write have fewer bugs than the Python code I see other teams churn out. But that's because I know what I'm doing. Don't hire people who don't know what they're doing.
I have switched to F# for scripting tasks and have found F# scripts are (usually) either correct on the first try or fail at the type-checking stage. I would highly recommend it for anything near production.
In the process of functional modification of the backup program by Hewlett-Packard Japan, the supplier of the supercomputer system, there was a problem in the unintentional modification of the program and its application procedure, which caused a malfunction in the process of deleting the files under the /LARGE0 directory instead of deleting the backup log files that are no longer needed.
Translated with www.DeepL.com/Translator (free version)
The cause of this is a known behavior of Unix/Linux scripts, but unfortunately not everyone knows this. If you change a script while it is running, the shell that runs it will read (what it thinks is) the next line from the old script, but it will be reading at the expected position in the old script file, but from the new script file. So what it reads and executes will probably not be what you wanted.
Assuming this was a "scratch" HPC filesystem, as I'd guess, "scratch" is used advisedly -- users should be prepared to lose anything on it, not that it should happen with finger trouble. However, if I understand correctly from the comments, I'm surprised at the tools, and that the vendor was managing the filesystem. I'd expect to use https://github.com/cea-hpc/robinhood/wiki with Lustre, though I thought I'd seen a Cray presentation about tools of their own.
[+] [-] rvnx|4 years ago|reply
They do not try to blame it on complex systems or other factors.
Users lost 1 day and 1/2 of recent work (which doesn't seem to be that bad).
[+] [-] dustintrex|4 years ago|reply
* 真 makoto, a word often glossed as "sincere" but not identical in meaning: it's about the amount of effort you're willing to take on, not how "honestly" you feel something
Also, the culprit here is not HP proper but their consulting/SI wing HP Enterprise, which has a, uhh, less than stellar reputation for competence.
[+] [-] doctor_eval|4 years ago|reply
Now I’ve known about this behaviour for a very long time and it always seemed very broken to me. It’s not how binaries work (at least not when I was doing that kind of thing).
So I guess bash or whatever does an mmap of the script it’s running, which is presumably why modifications to the script are visible immediately. But if a new file was installed eg using cp/tar/unzip, I’m surprised that this didn’t just unlink the old script and create a new one - which would create a new inode and therefore make the operation atomic, right? And this (I assume) is why a recompiled binary doesn’t have the same problem (because the old binary is first unlinked).
So, how could this (IMO) bad behaviour be fixed? Presumably mmap is used for efficiency, but isn’t it possible to mark a file as in use so it’s cant be modified? I’ve certainly seen on some old Unices that you can’t overwrite a running binary. Why can’t we do the same with shell scripts?
Honestly, while it’s great that HP is accepting responsibility, and we know that this happens, the behaviour seems both arbitrary and unnecessary to me. Is it fixable?
[+] [-] AdamJacobMuller|4 years ago|reply
This is an incredible edge case. I'm amazed they hit this issue and just as amazed that they correctly identified that issue and reported on it.
This response is great, it's the exact opposite of the wishy-washy mealy-mouthed response to the lastpass security incident.
[+] [-] KaiserPro|4 years ago|reply
about 15 years ago I experienced the same thing. An updater script based on rsync was trying to keep one nfs machine image in sync with another. However for what ever reason, the script accidentally tries to sync the entire nfs root directory with its own, deleting everything show by show in reverse alphabetical order.
At the time Lustre didn't really have any good monitoring tools for showing you who was doing what, so they had to wait till they hit a normal NFS server before they could figure out and stop what was deleting everything.
Needless to say, a lot of the backups may have been failing.
[+] [-] s5300|4 years ago|reply
Nice to see them completely own up to the mistake right away. I wonder who made the final call on doing so, companies admitting fault so transparently & immediately offering recourse seems pretty damn rare anymore.
Without the intent of sounding xenophobic, I wonder if it’s because it’s HP Japan where reputation is much more culturally important. US MBA’s admitting fault… haha…
[+] [-] nikanj|4 years ago|reply
[+] [-] Closi|4 years ago|reply
[+] [-] martin_vejmelka|4 years ago|reply
[+] [-] thaumasiotes|4 years ago|reply
I'm intrigued to see that the report you link (which is in Japanese) mentions `find` and `bash` by those names, but doesn't contain the word `cron`. How does the report refer to the idea of a "cronjob"? Why is it different?
[+] [-] capableweb|4 years ago|reply
[+] [-] olliej|4 years ago|reply
[+] [-] codedokode|4 years ago|reply
And this is why shell should not execute commands with "undefined" variables and give an error instead.
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] jacquesm|4 years ago|reply
[+] [-] taubek|4 years ago|reply
The style of apology is very nice. It is not extensive as some technical post mortem analysis that I've read, but all of the important things are here.
[+] [-] agumonkey|4 years ago|reply
[+] [-] Solstinox|4 years ago|reply
“A new improved version of the script was applied on the system.”
[+] [-] j1elo|4 years ago|reply
https://gist.github.com/robin-a-meade/58d60124b88b60816e8349... [^1]
And always, always, use ShellCheck (https://www.shellcheck.net/) to catch most pitfalls and common mistakes on this powerful but dangerous language that is shell scripting.
[^1]: I think this gist is better than the original article in which it is based, because the article also suggested changing the IFS variable, which is not that good of an advice, so sadly the original text becomes a bad recommendation!
[+] [-] xvilka|4 years ago|reply
[+] [-] thaumasiotes|4 years ago|reply
Not really; the report doesn't mention any error in the script.
[+] [-] karlerss|4 years ago|reply
[+] [-] marcan_42|4 years ago|reply
What I like to do is have two directories I ping pong between when deploying, and a `cur` symlink that points to the current version. The symlink is atomically replaced (new symlink and rename it over) whenever the deploy process completes. Any software/scripts using that tree will be written to first chdir() in, which will resolve the symlink at that time, and thus won't be affected by the deploy (at least as long as you don't do it twice in a row; if that is a concern due to long running processes, you could use timestamped directories instead and a garbage collection process that cleans stuff up once it is certain there are no users left).
[+] [-] db65edfc7996|4 years ago|reply
[+] [-] Jimmc414|4 years ago|reply
Just a note that "set -u" at the beginning of a bash script will cause it to throw an error for undefined variables. warning that of course this should be tested as it will also cause [[ $var ]] to fail.
If that's the case
[ -z "${VAR:-}" ] && echo "VAR is not set or is empty" || echo "VAR is set to $VAR"
will help test that condition
[+] [-] l33tman|4 years ago|reply
[+] [-] tyingq|4 years ago|reply
Is it maybe that they were editing or copying the file and a cron job kicked off?
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] nh2|4 years ago|reply
It's not an operator "lack of consideration".
The lesson should be "when dealing with important data, do not use outrageously bad programming languages that allow run-time code rewriting, and that continue to execute even in the presence of undefined variables".
If you use shell scripting, this is bound to happen, and will happen again.
"We'll use Python or anything else instead of shell" would fundamentally remove the possibility of this category of failure.
[+] [-] toast0|4 years ago|reply
Almost all languages allow run-time code rewriting. Some of them just make it easier than others, and some of them make it a very useful feature. If you're very careful, updating a bash script while you're running it can be useful, but most often it's a mistake; in Erlang, hot loading is usually intentional and often useful. Most other languages don't make it easy, so you'll probably only do it if it's useful.
[+] [-] 0xbadcafebee|4 years ago|reply
The shell scripts I write have fewer bugs than the Python code I see other teams churn out. But that's because I know what I'm doing. Don't hire people who don't know what they're doing.
[+] [-] booleandilemma|4 years ago|reply
[+] [-] pharmakom|4 years ago|reply
[+] [-] rguiscard|4 years ago|reply
Translated with www.DeepL.com/Translator (free version)
[+] [-] ananonymoususer|4 years ago|reply
[+] [-] deepsun|4 years ago|reply
[+] [-] quelsolaar|4 years ago|reply
[+] [-] alekun|4 years ago|reply
[+] [-] gnufx|4 years ago|reply
[+] [-] vardump|4 years ago|reply
[+] [-] Too|4 years ago|reply
[+] [-] fred_is_fred|4 years ago|reply
[+] [-] chmaynard|4 years ago|reply
[+] [-] moonbug|4 years ago|reply
[+] [-] bayindirh|4 years ago|reply
[+] [-] 0xbadcafebee|4 years ago|reply
[+] [-] pettycashstash2|4 years ago|reply