Japan HP accidentally deleted 77TB data in Kyoto U. supercomputing system

[+] rvnx|4 years ago|reply

I really appreciate the announcement from Hewlett Packard, which is very apologetic: https://www.iimc.kyoto-u.ac.jp/services/comp/pdf/file_loss_i...

They do not try to blame it on complex systems or other factors.

Users lost 1 day and 1/2 of recent work (which doesn't seem to be that bad).

  About file loss in Luster file system in your supercomputer system, we are 100% responsible.
  We deeply apologize for causing a great deal of inconvenience due to the serious failure of the file loss.

  We would like to report the background of the file disappearance, its root cause and future countermeasures as follows:

  We believe that this file loss is 100% our responsibility.
  We will offer compensation for users who have lost files.

  [...]

  Impact:
  --

  Target file system: /LARGE0

  Deleted files: December 14, 2021 17:32 to December 16, 2021 12:43

  Files that were supposed to be deleted: Files that had not been updated since 17:32 on December 3, 2021

  [...]

  Cause:
  --

  The backup script uses the find command to delete log files that are older than 10 days.

  A variable name is passed to the delete process of the find command.

  A new improved version of the script was applied on the system.

  However, during deployment, there was a lack of consideration as the periodical script was not disabled.

  The modified shell script was reloaded from the middle.

  As a result, the find command containing undefined variables was executed and deleted the files.

  [...]

  Further measures:
  --

  In the future, the programs to be applied to the system will be fully verified and applied.

  We will examine the extent of the impact and make improvements so that similar problems do not occur.

  In addition, we will re-educate the engineers in charge of human error and risk prediction / prevention to prevent recurrence.

 We will thoroughly implement the measures.

[+] dustintrex|4 years ago|reply

Japanese companies structure apologies very differently from US ones, because the legal consequences are very different. In the US, an apology is considered an admission of responsibility and is often the starting point of legal action against the culprit, while in Japan, a sufficiently sincere* apology may well defuse the situation entirely.

* 真 makoto, a word often glossed as "sincere" but not identical in meaning: it's about the amount of effort you're willing to take on, not how "honestly" you feel something

Also, the culprit here is not HP proper but their consulting/SI wing HP Enterprise, which has a, uhh, less than stellar reputation for competence.

[+] doctor_eval|4 years ago|reply

So this is something I’ve never understood. If you modify a shell script while it’s running, the shell executes the modified file. This normally but not always causes the script to fail.

Now I’ve known about this behaviour for a very long time and it always seemed very broken to me. It’s not how binaries work (at least not when I was doing that kind of thing).

So I guess bash or whatever does an mmap of the script it’s running, which is presumably why modifications to the script are visible immediately. But if a new file was installed eg using cp/tar/unzip, I’m surprised that this didn’t just unlink the old script and create a new one - which would create a new inode and therefore make the operation atomic, right? And this (I assume) is why a recompiled binary doesn’t have the same problem (because the old binary is first unlinked).

So, how could this (IMO) bad behaviour be fixed? Presumably mmap is used for efficiency, but isn’t it possible to mark a file as in use so it’s cant be modified? I’ve certainly seen on some old Unices that you can’t overwrite a running binary. Why can’t we do the same with shell scripts?

Honestly, while it’s great that HP is accepting responsibility, and we know that this happens, the behaviour seems both arbitrary and unnecessary to me. Is it fixable?

[+] AdamJacobMuller|4 years ago|reply

> The modified shell script was reloaded from the middle.

This is an incredible edge case. I'm amazed they hit this issue and just as amazed that they correctly identified that issue and reported on it.

This response is great, it's the exact opposite of the wishy-washy mealy-mouthed response to the lastpass security incident.

[+] KaiserPro|4 years ago|reply

Ahhh the joy of lustre and the accidental cronjob.

about 15 years ago I experienced the same thing. An updater script based on rsync was trying to keep one nfs machine image in sync with another. However for what ever reason, the script accidentally tries to sync the entire nfs root directory with its own, deleting everything show by show in reverse alphabetical order.

At the time Lustre didn't really have any good monitoring tools for showing you who was doing what, so they had to wait till they hit a normal NFS server before they could figure out and stop what was deleting everything.

Needless to say, a lot of the backups may have been failing.

[+] s5300|4 years ago|reply

Huh. I may be remembering incorrectly, but I recall having somebody somewhat entrenched in related business tell me that HP has been going downhill from an industry perspective roughly two years ago…

Nice to see them completely own up to the mistake right away. I wonder who made the final call on doing so, companies admitting fault so transparently & immediately offering recourse seems pretty damn rare anymore.

Without the intent of sounding xenophobic, I wonder if it’s because it’s HP Japan where reputation is much more culturally important. US MBA’s admitting fault… haha…

[+] nikanj|4 years ago|reply

Every shell script should start with set -e and set -u

[+] Closi|4 years ago|reply

Agreed - no corporate-speak, sounds like it was written by an actual human.

[+] martin_vejmelka|4 years ago|reply

Just pointing out that those are most likely just the days the files were saved. There could still be some unlucky souls that ran computations for several days/weeks that happened to terminate on those days (and store the results). Those people could lose significantly more than a day and a half. On the flip side, HP jobs tend to be frequently checkpointed unless the storage cost is prohibitive for the type of job.

[+] thaumasiotes|4 years ago|reply

> However, during deployment, there was a lack of consideration as the cronjob was not disabled.

I'm intrigued to see that the report you link (which is in Japanese) mentions `find` and `bash` by those names, but doesn't contain the word `cron`. How does the report refer to the idea of a "cronjob"? Why is it different?

[+] capableweb|4 years ago|reply

Interesting, seems the shell script was executed from the cron job just as it was being replaced on the server itself?

[+] olliej|4 years ago|reply

1.5 days isn't too bad. If it were me my primary concern would be losing bash history :D

[+] codedokode|4 years ago|reply

> As a result, the find command containing undefined variables was executed

And this is why shell should not execute commands with "undefined" variables and give an error instead.

[+] unknown|4 years ago|reply

[deleted]

[+] jacquesm|4 years ago|reply

The sense of honor and responsibility shining through is refreshing.

[+] taubek|4 years ago|reply

77TB in one and half day? Impressive.

The style of apology is very nice. It is not extensive as some technical post mortem analysis that I've read, but all of the important things are here.

[+] agumonkey|4 years ago|reply

What a strangely simple error

[+] Solstinox|4 years ago|reply

I think I found the problem:

“A new improved version of the script was applied on the system.”

[+] j1elo|4 years ago|reply

I guess this is as good of a time as any other to remind people to use the "unofficial" Bash strict mode:

https://gist.github.com/robin-a-meade/58d60124b88b60816e8349... [^1]

And always, always, use ShellCheck (https://www.shellcheck.net/) to catch most pitfalls and common mistakes on this powerful but dangerous language that is shell scripting.

[^1]: I think this gist is better than the original article in which it is based, because the article also suggested changing the IFS variable, which is not that good of an advice, so sadly the original text becomes a bad recommendation!

[+] xvilka|4 years ago|reply

And don't use shell for writing complex scripts, there are better automation tools and languages.

[+] thaumasiotes|4 years ago|reply

> I guess this is as good of a time as any other to remind people to use the "unofficial" Bash strict mode

Not really; the report doesn't mention any error in the script.

[+] karlerss|4 years ago|reply

When communicating non-critical data-loss to teammates, I like to do it with this haiku:

  Three things are certain:
  Death, taxes, and lost data.
  Guess which has occurred.

From https://www.gnu.org/fun/jokes/error-haiku.en.html

[+] marcan_42|4 years ago|reply

Everyone is mentioning error control for shell scripts or "don't use shell scripts", but neither of those are the solution to this problem. The solution to this problem is correctly implementing atomic deployment, which is important for any system using any programming language.

What I like to do is have two directories I ping pong between when deploying, and a `cur` symlink that points to the current version. The symlink is atomically replaced (new symlink and rename it over) whenever the deploy process completes. Any software/scripts using that tree will be written to first chdir() in, which will resolve the symlink at that time, and thus won't be affected by the deploy (at least as long as you don't do it twice in a row; if that is a concern due to long running processes, you could use timestamped directories instead and a garbage collection process that cleans stuff up once it is certain there are no users left).

[+] db65edfc7996|4 years ago|reply

The original blue-green deployment strategy. I have done a similar thing as well.

[+] Jimmc414|4 years ago|reply

>the find command containing undefined variables was executed and deleted the files

Just a note that "set -u" at the beginning of a bash script will cause it to throw an error for undefined variables. warning that of course this should be tested as it will also cause [[ $var ]] to fail.

If that's the case

[ -z "${VAR:-}" ] && echo "VAR is not set or is empty" || echo "VAR is set to $VAR"

will help test that condition

[+] l33tman|4 years ago|reply

I've been a Linux coder and user forever, and I didn't know that bash "reloads" a script while running if the file is modified. Good to learn before I also delete a whole filesystem due to this! :)

[+] tyingq|4 years ago|reply

Is that what happened? I can't reproduce that by changing a bash script that's running a while [ 1 ] loop.

Is it maybe that they were editing or copying the file and a cron job kicked off?

[+] unknown|4 years ago|reply

[deleted]

[+] unknown|4 years ago|reply

[deleted]

[+] nh2|4 years ago|reply

    However, during deployment, there was a lack of consideration as the periodical script was not disabled.

    The modified shell script was reloaded from the middle.

In my opinion, this is the wrong takeaway, and an important lesson was not learned.

It's not an operator "lack of consideration".

The lesson should be "when dealing with important data, do not use outrageously bad programming languages that allow run-time code rewriting, and that continue to execute even in the presence of undefined variables".

If you use shell scripting, this is bound to happen, and will happen again.

"We'll use Python or anything else instead of shell" would fundamentally remove the possibility of this category of failure.

[+] toast0|4 years ago|reply

> outrageously bad programming languages that allow run-time code rewriting

Almost all languages allow run-time code rewriting. Some of them just make it easier than others, and some of them make it a very useful feature. If you're very careful, updating a bash script while you're running it can be useful, but most often it's a mistake; in Erlang, hot loading is usually intentional and often useful. Most other languages don't make it easy, so you'll probably only do it if it's useful.

[+] 0xbadcafebee|4 years ago|reply

The problem was not that they used shell scripts. The problem was that the people writing the shell scripts were just bad programmers. If you hire a bad programmer to write them in Python, they'll still have tons of bugs.

The shell scripts I write have fewer bugs than the Python code I see other teams churn out. But that's because I know what I'm doing. Don't hire people who don't know what they're doing.

[+] booleandilemma|4 years ago|reply

It’s amazing how human errors scale with technology. Just imagine, one day we’ll be making mistakes at the Type III civilization level! :)

[+] pharmakom|4 years ago|reply

I have switched to F# for scripting tasks and have found F# scripts are (usually) either correct on the first try or fail at the type-checking stage. I would highly recommend it for anything near production.

[+] rguiscard|4 years ago|reply

In the process of functional modification of the backup program by Hewlett-Packard Japan, the supplier of the supercomputer system, there was a problem in the unintentional modification of the program and its application procedure, which caused a malfunction in the process of deleting the files under the /LARGE0 directory instead of deleting the backup log files that are no longer needed.

Translated with www.DeepL.com/Translator (free version)

[+] ananonymoususer|4 years ago|reply

The cause of this is a known behavior of Unix/Linux scripts, but unfortunately not everyone knows this. If you change a script while it is running, the shell that runs it will read (what it thinks is) the next line from the old script, but it will be reading at the expected position in the old script file, but from the new script file. So what it reads and executes will probably not be what you wanted.

[+] deepsun|4 years ago|reply

Yet another bug due to using command-line interface (which is designed for humans not programs) by programs.

[+] quelsolaar|4 years ago|reply

Who brought tres commas?

[+] alekun|4 years ago|reply

just in case someone didn’t see this masterpiece https://youtu.be/vvDK8tMyCic

[+] gnufx|4 years ago|reply

Assuming this was a "scratch" HPC filesystem, as I'd guess, "scratch" is used advisedly -- users should be prepared to lose anything on it, not that it should happen with finger trouble. However, if I understand correctly from the comments, I'm surprised at the tools, and that the vendor was managing the filesystem. I'd expect to use https://github.com/cea-hpc/robinhood/wiki with Lustre, though I thought I'd seen a Cray presentation about tools of their own.

[+] vardump|4 years ago|reply

That's a lot of floppy disks!

[+] Too|4 years ago|reply

Reference: https://news.ycombinator.com/item?id=29738298

[+] fred_is_fred|4 years ago|reply

This is HPE - not HP. Servers, not Printers.

[+] chmaynard|4 years ago|reply

Clearly, the only honorable thing for the CEO of HPE to do is to ... er, blame Sunny Balwani!

[+] moonbug|4 years ago|reply

it's a lustre filesystem. the data would've been eaten eventually anyway.

[+] bayindirh|4 years ago|reply

What would make you think of that?

[+] 0xbadcafebee|4 years ago|reply

Not really surprising. HPE has provided bottom-of-the-barrel support for decades.

[+] pettycashstash2|4 years ago|reply

Looks like 10 of 14 groups were restored from backup.

153 comments