(no title)
Artgor | 2 years ago
It isn't that hard, I was able to run in on M1. The changes are:
remove or modify multiprocessing - it doesn't work on Mac the same way as in the code;
replace `device = "cuda"` with `device = "mps"`
In this line ` att_idxs = (torch.clamp(torch.arange(context_size)[None, :] - torch.arange(context_size)[:, None], -pos_emb_radius, pos_emb_radius-1) % pos_emb_size).to("cuda")` replace cuda with "mps"
in `optim.AdamW` remove `fused=True` - we can't do it without CUDA
Replace ```with autocast(device_type='cuda', dtype=torch.float16): _, loss = mlm_head(bert(batch_data_torch_xs[mb_start_idx:mb_end_idx]), batch_data_torch_ys[mb_start_idx:mb_end_idx]) ```
with simply `_, loss = mlm_head(bert(batch_data_torch_xs[mb_start_idx:mb_end_idx]), batch_data_torch_ys[mb_start_idx:mb_end_idx])`
replace `scaler.scale(corrected_loss).backward()` with `corrected_loss.backward()`
replace ``` scaler.unscale_(optimizer) scaler.step(optimizer) scaler.update() ``` with `optimizer.step()`
It should work.
No comments yet.