WingNews logo WingNews
top | new | best | ask | show | jobs
top | item 40504843

(no title)

jorlow | 1 year ago

Note llama's feed forward is a bit different too:

  self.w2(F.silu(self.w1(x)) * self.w3(x))
I.e. the nonlinearity is a gate.

https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a959...

discuss

order

soraki_soladead|1 year ago

Fwiw, that's SwiGLU in #3 above. Swi = Swish = silu. GLU is gated linear unit; the gate construction you describe.
powered by hn/api // news.ycombinator.com