top | item 29482608

(no title)

dvdkhlng | 4 years ago

I don't know anything about the details here, but with usual linear algebra stuff, bandwidth depends on the size of the kernel that fits into the local memory inside whatever IC you use for your floating-point computation.

E.g. matrix multiplication of n×n square matrices has computational cost of n³ but bandwidth cost of n². Usuall a big m x m matrix is split into many blocks of n×n matrices (with m = k×n). If a n×n matrix fits into the local store of your CPU (cache or registers), then bandwidth cost for the m x m matrix product is k³×n×n = m×m×m/n, so the bigger the block-size 'n' that you can process inside the CPU, the less bandwidth you need.

edit: formatting

discuss

No comments yet.