top | item 39832018

(no title)

Additionally instructions on training/inference on mac - https://github.com/adamkarvonen/nanoGPT

> To sample on Mac, uncomment line 21 in sample.py. To train on Mac, rename train_shakespeare_char_mac.py to train_shakespeare_char.py

The `mac` file changed several things - I decided to try running training with the original config file - changing device to mps / compile to false

    iter 100: loss 2.0268, time 815.43ms, mfu 3.24%
    iter 200: loss 1.8523, time 818.79ms, mfu 3.24%
    iter 300: loss 1.7799, time 823.05ms, mfu 3.23%
    iter 400: loss 1.6887, time 819.08ms, mfu 3.23%

Training is ~4x slower than the speed reported on the original multi-GPU run: https://wandb.ai/adam-karvonen/chess-gpt-batch/runs/zt5htyl6...

Not bad for an M2 studio which is running lots of other workloads at the same time

discuss

a_wild_dandan|1 year ago

I wonder if using Apple’s new MLX Python library (for training on unified memory systems) would yield significant gains.

nl|1 year ago

This (the device='mps' version) already uses the unified memory plus GPU on M-series Macs.

It's possible MLX has some additional micro optimizations, but in general most people who have tried it out against hand-written MPS based training implementations haven't found great speed ups yet.