top | item 43893882

(no title)

JanSchu | 10 months ago

Nice write‑up. A couple of notes from doing roughly the same dance on Cortex‑M0 and M3 boards for sensor fusion.

1. You can, in fact, get rid of every FP instruction on M0. The trick is to pre‑bake the scale and zero_point into a single fixed‑point multiplier per layer (the dyadic form you mentioned). The formula is

ini Copy Edit y = ((Wx + b) M) >> s Where M fits in an int32 and s is the power‑of‑two shift. You compute M and s once on the host, write them as const tables, and your inner loop is literally a MAC followed by a multiply‑accumulate‑shift. No fpsoft library, no division.

2. CMSIS‑NN already gives you the fast int8 kernels. The docs are painful but you can steal just four files: arm_fully_connected_q7.c, arm_nnsupportfunctions.c, and their headers. On M0 this compiled to ~3 kB for me. Feed those kernels fixed‑point activations and you only pay for the ops you call.

3. Workflow that kept me sane

Prototype in PyTorch. Tiny net, ReLU, MSE, Adam, done.

torch.quantization.quantize_qat for quantization‑aware training. Export to ONNX, then run a one‑page Python script that dumps .h files with weight, bias, M, s.

Hand‑roll the inference loop in C. It is about 40 lines per layer, easy to unit‑test on the host with the same vectors you trained on.

By starting with a known‑good fp32 model you always have a checksum: the int8 path must match fp32 within tolerance or you know exactly where to look.

discuss

lynaghk|10 months ago

Awesome, thanks! This is exactly the kind of experienced take I was hoping my blog post would summon =D

Re: computing M and s, does torch.quantization.quantize_qat do this or do you do it yourself from the (presumably f32) activation scaling that torch finds?

I don't have much experience with this kind of numerical computing, so I have no intuition about how much the "quantization" of selecting M and s might impact the overall performance of the network. I.e., whether

- M and s should be trained as part of QAT (e.g., the "Learned Step Size Quantization" paper)

- it's fine to just deterministically compute M and s from the f32 activation scaling.

Also: Thanks for the tips re: CMSIS-NN, glad to know it's possible to use in a non-framework way. Any chance your example is open source somewhere?