top | item 38817532

(no title)

victotronics | 2 years ago

You do a lot of scare quotes. Do you have any suggestions on how things could be different? You need batch jobs because the scheduler has to wait for resources to be available. It's kinda like Tetris in processor/time space. (In fact, that's my personal "proof" that workload scheduling is NP-complete: it's isomorphic to Tetris.)

And what's wrong with shell scripts? It's a lingua franca, generally accepted across scientific disciplines, cluster vendors, workload managers, .... Considering the complexity of some setups (copy data to node-local file systems; run multiple programs, post-process results, ... ) I don't see how you could set up things other than in some scripting language. And then unix shell scripts are not the worst idea.

Debugging failures: yeah. Too many levels where something can go wrong, and it can be a pain to debug. Still, your average cluster processes a few million jobs in its lifetime. If more than a microscopic portion of that would fail, computing centers would need way more personnel than they have.

discuss

order

crabbone|2 years ago

> And what's wrong with shell scripts?

When used as configuration? Here are some things that are wrong:

* Configuration forced into a single line makes writing long lines inconvenient (for example, if you want Slurm with Pyxis, and you need to specify the image name -- it will most likely not fit on the screen.

* Oh, and since we mentioning Pyxis -- their image names have pound sign in them, and now you also need to figure out how to escape it, because for some reason if used literally it breaks the comments parser.

* No syntax highlighting (because it's all comments).

* No way to create more complex configuration, i.e. no way to have any types other than strings, no way to have variables, no way to have collections of things.

* No way to reuse configuration (you have to copy it from one job file to another). I honestly don't even know what happens if you try to source a job configuration file from another job configuration.

All in all, it's really hard to imagine a worse configuration format. This sounds like a solution from some sort of a code-golfing competition where the goal was to make it as bad as possible, while still retaining some shreds of functionality.