top | item 43175114

(no title)

notpublic | 1 year ago

please do explain why

discuss

order

minimaxir|1 year ago

tl;dr the base ModernBERT was trained with code in mind unlike most encoder-only models (therefore assuming it was also trained on JSON/YAML objects) and also includes a custom tokenizer to support that, which is why I mention that indentation is important since different levels of indentation have different single tokens.

This is mostly theoetical and does require a deeper dive to confirm.