top | item 36312488

Text Editor Data Structures

367 points| starfreakclone | 2 years ago |cdacamar.github.io

78 comments

order

badsectoracula|2 years ago

There are tons of articles about plain text editor data structures, but what about rich text editor data structures? Let's say i want to implement a text editor that can have bold, italic, underline, etc text but also be able to do automatic word breaking, align paragraph text to left/middle/right, insert images and/or other objects, have floating images and/or other objects around which the other (non floating) text/images/objects wrap, etc.

What would be the data structures for that? I can only think of trying to replicate something like the HTML DOM but i have a feeling something like Write for Windows 3.1 used a simpler data structure.

KerrAvon|2 years ago

It's not actually that different or complicated if you're already doing proportional plaintext rendering -- you need to store style attributes for a range of text, and you need to support different heights per line.

The real complexity is rendering all of unicode properly, and supporting international fonts, bidi layout, vertical text, etc.

samwillis|2 years ago

I believe most do use a DOM like structure for block level elements. For inline styling however they tend to not have separate nested nodes for each style (bold, italic, underline), and are closer to the old HTML4 <font> tag with multiple properties for each style and don't nest them. It makes for much simpler code.

The docs for ProseMirror are a brilliant insight into how many of these editors are designed. ProseMirror maintains its own document model in parallel to the html DOM and rectifies one to the other as either changes.

For realtime collaborative rich text editors take a look at PeriText, they have a brilliant article explaining the CRDT structures used.

https://prosemirror.net

https://www.inkandswitch.com/peritext/

throwaway2037|2 years ago

I am pretty sure that Write for Win3.1 used the Rich Edit control, or an early, non-public pre-cursor.

Here are the Win32 docs: https://learn.microsoft.com/en-us/windows/win32/controls/ric...

The more I read about this control, the more I learn about its insane feature set! Microsoft continues to make significant improvements to a version that is only shipped with Microsoft Office -- not available from a barebones Win7/10/11 install. Read more here: https://devblogs.microsoft.com/math-in-office/using-richedit...

Rich Edit control also supports the Text Object Model, which is very powerful. Read more here: https://learn.microsoft.com/en-us/windows/win32/api/tom/nn-t...

eludwig|2 years ago

Many older word processors (the only ones I'm familiar with) used style "runs", which are simply an array of structs that contain startPosition, endPosition (or length) and then a bunch of style info in whatever way makes sense to them.

I worked on Ready,Set,Go! back in the day and also wrote my own styled editor for the Mac in the 90's.

One interesting thing about this is that you could use a lazy "adjustor" to remove duplicate or overlapping styles. No need to worry about it during typing or selecting. A low priority task could analyze the runs and fix them up as needed and no one was the wiser.

IMO, the hardest part about writing a word processor back then was font management. You had to construct width tables to be able to compose lines properly. This generally involved a bunch of poorly documented APIs and peering into opaque data structures.

thangalin|2 years ago

RichTeXtFX[1] has these abilities[2] along with demo code[3] that provides an entry point for sleuthing. My text editor, KeenWrite[4], uses RichTeXtFX, but takes a different approach due to its usage of inline variable references. Rather than edit the document in a WYSIWYG-like fashion, users edit the Markdown document as plain text then export as PDF by selecting a theme to apply for typesetting. This cleanly separates content from presentation. The "themes" video tutorial demonstrates how theme selection works.[5]

[1]: https://github.com/FXMisc/RichTextFX/tree/master/richtextfx/...

[2]: https://gluonhq.com/presenting-a-new-richtextarea-control/

[3]: https://github.com/gluonhq/rich-text-area/blob/master/sample...

[4]: https://github.com/DaveJarvis/keenwrite

[5]: https://www.youtube.com/watch?v=3QpX70O5S30&list=PLB-WIt1cZY...

convolvatron|2 years ago

its probably instructive to look at elisp text properties. not so much that its a great system - but it raises some messy questions about how to operate on substrings that have different sets of properties

kjs3|2 years ago

Perhaps look at how (Open,Libre)Office does it to support ODT format (e.g. OASIS Open Document Format)?

chaoz_|2 years ago

You can always check Qt TextEdit source code ;)

nsajko|2 years ago

This seems revisionist/ignorant. The article attributes a "piece tree" data structure to VS Code developers, however the must-read 1998 paper by Charles Crowley, Data Structures for Text Sequences, already mentions search trees as being used to enhance the naive piece table in some text editors.

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.48...

starfreakclone|2 years ago

Thank you for the pointer! I was actually not aware of this paper at all, which is why it was not included here (not sure how I missed it). I'll be sure to push a revision to the blog which mentions this paper.

keithnz|2 years ago

I think you could probably get rid of the first sentence.

phildenhoff|2 years ago

Thanks to this article, I learned that the core data structure for VSCode's text is written in TypeScript[0]. I, uh, I knew VS Code was written in TypeScript but I didn't realise _all_ of it was. It's crazy to think that the editor works as well as it does!

[0]: https://github.com/microsoft/vscode/tree/main/src/vs/editor/...

cmrdporcupine|2 years ago

I used a red-black tree written in JS over 15 years ago at a job where I needed to use lower bound and range operations for nodes in a pivot-table type dashboard for telecoms metrics. I was actually blown away at the time at how it actually performed rather decently, and this is back in the pre-Chrome Firefox & IE6&IE7 days.

Now in the world of V8 and JavaScriptCore, I don't think it's crazy at all. So much work has gone into JS runtime optimization. For heavily concurrent and memory intensive workloads, I can imagine problems though.

celeritascelery|2 years ago

> if the underlying buffer needs to reallocate after some edits the entire process slows to a crawl as you need to allocate a larger buffer and copy each byte into the new buffer and destroy the old one. It can turn an O(n) operation into O(n^2) randomly.

This would still only be a O(n) operation. The constant value might be higher, but the complexity is the same.

I don’t buy the argument that gap buffers “are bad for multiple cursors”. I get the argument on a theoretical level, but real hardware is not theoretical. There are several operations that are theoretical faster with a hashmap or b-tree than a vector, like insert and delete. But in reality the vector is usually faster in the real world except for very large inputs[1]. Gap buffers are basically vectors.

Another point with multiple cursors and gap buffers is that Chris Wellons animations show the gap moving back to the first cursor every time it needed to add a new character. But in reality you would just walk the cursors in reverse order each time, saving a trip.

I have actually written and benchmarked a very naive and unoptimized gap buffer, and the results showed that it was faster than highly optimized rope implementations on all real world benchmarks[2], including multiple cursors.

That being said, a gap buffer is still probably not the best data structure for a text editor because it has worse “worse case” performance than something like a rope or piece-tree. Even though it is faster overall, it’s the tail latency's that really matter for interactive programs.

Overall I enjoyed reading the post, I find the topics fascinating and this was well presented.

[1] http://www.goodmath.org/blog/2009/02/18/gap-buffers-or-dont-...

[2] https://github.com/CeleritasCelery/rune/issues/17#issuecomme...

nsajko|2 years ago

These data structures are not mutually exclusive, see the paper by Crowley linked elsewhere in the discussion. You could use, e.g., a piece table with a gap buffer for descriptors, or a piece table with a tree for descriptors.

vbezhenar|2 years ago

Can someone point me to structures/algorithms to use for text editor which could support files of unlimited lengths (including lines of unlimited length) without loading those fully in the buffer?

I miss that editor so much, that I'm considering to write one some day, but I have no idea how to do so. I can invent things myself, but I guess those things were invented already back in the days computers were different.

modernerd|2 years ago

You might be interested in ewig and immer by Juan Pedro Bolivar Puente:

https://github.com/arximboldi/ewig

https://github.com/arximboldi/immer

See the author instantly opening a ~1GB text file with async loading, paging through, copying/pasting, and undoing/redoing in their prototype “ewig” text editor about 27 minutes into their talk here:

https://m.youtube.com/watch?v=sPhpelUfu8Q

It’s backed by a “vector of vectors” data structure called a relaxed radix balanced tree:

https://infoscience.epfl.ch/record/169879/files/RMTrees.pdf

That original paper has seen lots of attention and attempts at performance improvements, such as:

https://hypirion.com/musings/thesis

https://github.com/hyPiRion/c-rrb

celeritascelery|2 years ago

That is essentially what VLF[1] does in Emacs. It reads in discrete chunks of the file at a time and doesn’t load the next one till you try to display it. Doesn’t require any fancy data structures, just some extra book keeping and mechanics.

[1] https://github.com/m00natic/vlfi

hgs3|2 years ago

> Create debugging utilities for data structures very early on.

Yes, this isn't encouraged enough. I often serialize my data structures as either JSON or Graphviz DOT files for visualization. It helps save an immense amount of time. You can also use the generated files for regression testing, i.e. diff the actual output with the serialized output and if they're different, then a bug was introduced.

mannyv|2 years ago

Apple's MPW was Apple's programming environment for a long time.

It's been a while, but I believe that it represented everything in a tree of n-character chunks (n = 6?). It was probably the first editor that I used that could open files of pretty much any length.

eschneider|2 years ago

I do miss the 'select and execute', and 'everything is a shell' model of MPW. Definitely one of my favorite development environments.

Onavo|2 years ago

Another potential approach is to use a traditional rope as-is, and generate a series of reverse operational transforms everytime there is an edit. Run a "compacting operation" every n edits to keep the stack of ops from growing out of hand.

egonschiele|2 years ago

I've been wondering if there's a good algorithm to handle highlights in a piece of text. If I highlight some text, lets say the part between the => <= here:

"Idioteque" is a song => by the English rock band Radiohead <=, released on their fourth album, Kid A (2000).

If I want to then edit this text, is there an efficient algorithm for figuring out the start and end index of the highlight for the edited text?

kerkeslager|2 years ago

There's way too little detail given in this question for you to get a useful answer. How is the text stored? How are you highlighting if you don't have the indices?

bjoli|2 years ago

I would probably reach for an RRB tree which would, apart from being very very neat, give me free undo/redo.

kerkeslager|2 years ago

I feel like the rope was too quickly discarded here. The author stopped exploring ropes because they don't fit criteria 2--efficient undo/redo. There are a few problems with claiming that ropes aren't efficient for this usage:

1. The claim that the rope is inefficient for undo/redo is based on a singular example of a small edit on short strings. This isn't where ropes shine, admittedly, but they don't need to shine there, because when dealing with such small pieces of data, pretty much anything you do will be faster than the user can see. If using larger strings, the space allocated for nodes becomes background noise as the strings themselves dominate the size.

2. The chosen solution, the piece table, is more memory-efficient than the rope at first glance, but that's a surface-level efficiency. The eventually-chosen solution, a piece tree, is far less memory efficient. Sure, at first glance it is more memory-efficient, but this is at the expense of tree traversals, which in the VSCode article, are addressed with cache, which... uses more memory. In the author's implementation there's even more memory used because there's a requirement he didn't include in his list: he wants it all to be immutable. Nevermind that ropes were immutable from the start...

3. If you have a document which uses a lot (read, thousands) of very small edits, then the size of small strings might start to matter. So if you're going to optimize for this, optimize it. There are some fairly small optimizations that make the inefficiency concerns completely irrelevant. One is pointer packing: in a 64-bit system, pointers are 64 bits, but in practice, the vast majority of systems use 48 or fewer of those bits: as it turns out, there aren't many systems with more than 2^48 bytes = 256 terabytes of RAM. This means the leading 16 bits are 0s. Trivially, this means you can store strings of 7 8-bit characters in the pointer itself, using the first 8 bits to signal if it's a string or pointer (if they're all 0s, it's a pointer) and the length of the string. All the strings in the inefficiency example can fit in a 64 bit integer: "Hello", " ", and "world" are all fewer than 7 bytes, which means you're passing around 64-bit integers, by value, with no allocations necessary. In fact, this means you can either append the " " to "Hello" like "Hello ", or prepend it to "world" like " world", and still stay under 7 bytes in either case. Remember, this is now 64 bit integers being passed around by value: this is far faster than piece trees.

4. The author treats undo/redo as a stack, but all the cool editors treat it as a tree. If you make a change, then undo it, then make another change, then undo the second change, is your first change lost? In vim/emacs, the answer is no: you can go back into a tree and find it to reapply. This means that all text is not only immutable but immortal: it has to be kept for the duration of the editing session. This enables a few more optimizations: we no longer need reference counts or garbage collection since we aren't reclaiming the memory, and now we can point into existing strings since they'll never change. Consider the following string: "The quick brown box jumps over the lazy dog." You may have noticed a typo: "box" should be "fox". This change requires 0 buffer allocations: we have a pointer to "The quick brown box jumps over the lazy dog." for the original string, a pointer to the same spot for "The quick brown " (with a length), a second pointer to "ox jumps over the lazy dog.", and a packed pointer (integer) for the string "f". This is pretty key because if you're not freeing any of this memory, you need to make sure you don't allocate more than necessary!

NOTE: I'm not saying that the rope is the better structure here. There may be more requirements which weren't captured in the article which mean that piece buffers really are the right answer. All I'm saying is that the article doesn't really explore ropes deeply enough to write them off so quickly.

tonetheman|2 years ago

Why is starfreakclone green? The username I mean.