(no title)
rejuvyesh | 3 years ago
Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Initial Train Loss: 5.7188
Initial Test Loss: 5.2812
Took: 11.41 seconds
Train Loss: 0.0354
Test Loss: 0.0514
Took: 11.58 seconds
Train Loss: 0.0327
Test Loss: 0.0511
Took: 11.42 seconds
Train Loss: 0.0316
Test Loss: 0.0505
I think almost of the time here is python overhead because if we increase the batch size 10x, it still takes the same time: Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Initial Train Loss: 5.5391
Initial Test Loss: 5.5938
Took: 11.03 seconds
Train Loss: 0.0444
Test Loss: 0.0545
Took: 11.16 seconds
Train Loss: 0.0388
Test Loss: 0.0496
Took: 11.01 seconds
Train Loss: 0.0384
Test Loss: 0.0490
See [gist](https://gist.github.com/rejuvyesh/6c428ea12154edbb36cd4359fa...) for the implementation.
No comments yet.