Storing Text With A Gap Buffer

Writing a code editor control is more difficult than you might imagine for many reasons. Not only do editors need to look nice and be extensible but importantly they need to be fast. Nobody is going to enjoy writing / coding if there's a noticeable delay when typing or manipulating text. One of the fundamental components of a code editor that impacts performance is how it stores text internally.

There are many ways an editor control could choose to store text. The most naive approach would be to just keep the contents as a String, appending to it as you type and using Left(), Right() and Middle() to extract sub parts needed (e.g. when removing or replacing contiguous characters). Trust me when I say that this doesn't work well once you get more than a few hundred characters.

Another approach to consider is that used by XUICodeEditor. XUICodeEditor uses an array of Line classes that store only the characters on that line in the form of a String array. Conceptually this is pretty easy to understand and feels like a sensible swing at the problem. This solution is reasonably fast since there are rarely more than a hundred or so characters on a line. A significant drawback is that it's slow to get all the text in the editor as we have to iterate each line and concatenate. This makes it challenging to parse a document without a lexer that can manage only being able to see the current line (and perhaps the line before and afterwards).

A third, better, but not not ideal choice would be to store the entire document as an array of characters. This makes adding and removing characters at specific indices quick and easy. It's also straightforward enough to append characters. There are challenges involved in extracting arbitrary runs of characters from within the array since you have no option with Xojo other than to loop over the array and extract the characters. A major downside with this approach is that appending single characters either within the middle of the array or even at the the end with Array.Add requires Xojo to dynamically resize the array for each character added. Once the document has a few hundred characters, you'll start to notice the time it takes for this resizing to occur.

OK smart arse, what's the solution then? Well there are many but in my opinion the best approach is to use a gap buffer.

Gap buffer to the rescue

A gap buffer is a data structure that leverages the fact that changes to the contents of a document tend to occur around an area of interest. This makes sense if we consider the caret as the area of interest. Editing a document usually involves bursts of typing to insert characters immediately after the caret.

The principle of a gap buffer is that, within the data structure, you have the text stored as two contiguous areas of text separated by a gap. It's within this gap that new text can be inserted efficiently.

Consider a normal array where each element is a character:

|0|1|2|3|4|5|6|7|
|h|e|l|l|o|y|o|u|

To add a hyphen between "hello" and "you" we need to grow the array since there are only 8 elements and we need 9:

|0|1|2|3|4|5|6|7|8|
|h|e|l|l|o|-|y|o|u|

In Xojo, the framework would create a new 9 element array and copy all the characters of the old array to their new positions. You don't see this as it's all handled for you by Array.InsertAt but that is what's happening behind the curtain.

A gap buffer on the other hand would look something like this:

|0|1|2|3|4|5|6|7|8|9|a|b|c|
|h|e|l|l|o| | | | | |y|o|u|
          ^

Where ^ is the start of the gap.

The buffer's underlying data storage (which could be an array or a MemoryBlock) has two contiguous areas of text ("hello" and "you") but there is a gap of unused elements between them. As in the earlier case, to insert a hyphen between "hello" and "you" we do this:

|0|1|2|3|4|5|6|7|8|9|a|b|c|
|h|e|l|l|o|-| | | | |y|o|u|
            ^

Notice how we haven't needed to grow our backing store at all. This means we haven't had to copy / move anything. All that was required was a write to an index and adjusting the gapStart property. So long as the user keeps typing within the gap the operation will be fast.

Once the gap gets filled up then we need to create a new backing store and copy the data over. We'll ensure that the new data store has another gap and the process repeats. To edit characters anywhere in the document we just need to move the gap. Moving the gap does have a performance penalty as we need to grow the backing store but over the course of all operations, that cost is amortised. The end result is a text storage solution that is fast enough for a text editor.

Implementation

Obviously I didn't invent the idea of a gap buffer. My code is inspired by code from Alex Restrepo's CustomEditField as well as several other open source implementations in other programming languages.

You can find the implementation in the SyntaxArea GitHub repository. As of this post, the latest commit was 1a9903e.

We need two classes and a class interface for our implementation. Strictly speaking you could ditch the class interface (ITextStorage) but since this class will be used in my SyntaxArea project, I'm going to future proof things in case I find an even more efficient way to store text (such as a rope) in the future. By incorporating an interface I could in theory swap the gap buffer out for something better down the road.

ITextStorage

First up, the interface. These are the methods that our GapBuffer class will implement. As you can see, we have methods for inserting, removing, replacing and retrieving text at specified indexes over arbitrary lengths.

GapBufferData

Our GapBuffer class internally will use a MemoryBlock to store the actual characters of text. I've wrapped that up in a class that provides some useful methods. In theory, we could update the GapBuffer class to use a different data store if we figure out a better one.

The class has just one property, mStorage which is a MemoryBlock in which we will store each character in UTF-32 format. Why UTF-32? Well in this encoding each character is fixed at 4 bytes. An encoding like UTF-8 uses a variable number of bytes for each character (between 1 and 4). Whilst UTF-32 will use more memory, it makes indexing into the MemoryBlock to retrieve and set a character much easier. It sacrifices more memory but on a modern computer that is not an issue.

The constructor is basic - we simply pass it the size (number of characters to store) and we multiply that by a class constant BYTES_PER_CHAR (4) to get the correct number of bytes the MemoryBlock needs to be.

The class has a computed property, Size which retrieves the maximum number of characters that the class can contain. It is not the number of characters in the store. In the code you'll note how we need to multiply or divide by the constant BYTES_PER_CHAR since we use 4 bytes for every character stored.

There's a Copy() method that copies a portion (or all) of one GapBufferData instance into another. Lastly we have two overloaded methods, StringValue(). These either retrieve or set a run of characters.

GapBuffer

The meat of the action resides in the GapBuffer class. This implements the ITextStorage interface. The code comprehensively documented so should be easy to understand. There's not a lot in the methods - that's part of the beauty of gap buffers (and the secret to their great performance) - they are fairly simple data structures.

Subscribe to garrypettet.com

Sign up now and never miss a new post. It's free.
Jamie Larson
Subscribe