(no title)
bdash | 29 days ago
The Apple Silicon CPU Optimization Guide has a lot of great information on SME and SSVE, along with more general information on optimizing for Apple's CPUs
A few quotes from Apple's guide that are particularly relevant to SSVE, from "SSVE Vector Execution Unit Optimization":
> Broadly, this unit is designed to support long vector and matrix operations performed on ZA storage _in the SME Processing Grid_.
> Recommendation: Use SSVE in a supporting role to enable high throughput SME grid computation.
> [Magnitude: High | Applicability: High] SSVE offers wide 64B vectors. While the ISA includes instructions that can operate on multi-vectors, the throughput is often only one 64B vector per cycle. Use SSVE to enable SME, which offers higher parallelism.
> Because of non-speculative execution, communication latencies, and in some cases long memory and computation latencies, SME engine instructions trail execution in the core by dozens to thousands of cycles. Any core compute instructions that consume data produced by the SME engine may have to wait an indeterminate (but long) amount of time for the data to arrive.
anematode|29 days ago
If SSVE is slow, I was hoping that SME instructions could be used in a vector-like fashion (e.g. add two matrices with high throughput, or a Hadamard/element-wise product) but it seems most matrix accelerator ISAs don't have that.
bdash|29 days ago
unknown|29 days ago
[deleted]
tom_|29 days ago
bdash|29 days ago