Proteins are linear molecules consisting of sequences of (mostly) 20 amino acids. You can see the list of amino acids here: https://en.wikipedia.org/wiki/Amino_acid#Table_of_standard_a.... There is a standard encoding of amino acids using single letters, A for alanine, etc. Earlier versions of ESM (I haven't read the ESM3 paper yet) uses one token per amino acid, plus a few control tokens (beginning of sequence, end of sequence, class token, mask, etc.) Earlier versions of ESM were BERT-style models focused on understanding, not GPT-style generative models.
No comments yet.