[dev] [RFC] Design of a vim like text editor

From: Marc André Tanner <mat_AT_brain-dump.org>
Date: Sat, 13 Sep 2014 16:01:15 +0200

TLDR: I'm writing an experimental but (hopefully) highly efficient vim
like text editor based on a piece chain data structure. You will find
an url to a git repository at the end of this rather long mail.

Help welcome!

Why another text editor?
========================

It all started when I was recently reading the excellent Project Oberon[0],
where in chapter 5 a data structure for managing text is introduced.
I found this rather appealing and wanted to see how it works in practice.

After some time I decided that besides just having fun hacking around I
might as well build something which could (at least in the long run)
replace my current editor of choice: vim.

This should be accomplished by a reasonable amount of clean (your mileage
may vary), modern and legacy free C code. Certainly not an old, 500'000
lines[1] long, #ifdef cluttered mess which tries to run on all broken
systems ever envisioned by mankind.

Admittedly vim has a lot of functionally, most of which I don't use. I
therefore set out with the following main goals:

 - Unicode aware

 - binary clean

 - handle arbitrary files (this includes large ones, think >100M SQL-dumps)

 - unlimited undo/redo support

 - syntax highlighting

 - regex search (and replace)

 - multiple file/window support

 - extensible and configurable through familiar config.def.h mechanism

The goal could thus be summarized as "80% of vim's features (in other
words the useful ones) implemented in roughly 1% of the code".

Finally and most importantly it is fun! Writing a text editor presents
some interesting challenges and design decisions, some of which are
explained below.

Text management using a piece table/chain
=========================================

The core of this editor is a persistent data structure called a piece
table which supports all modifications in O(m), where m is the number
of non-consecutive editing operations. This bound could be further
improved to O(log m) by use of a balanced search tree, however the
additional complexity doesn't seem to be worth it, for now.

The actual data is stored in buffers which are strictly append only.
There are two types of buffers, a fixed-sized for the original file
content and append-only ones one for all modifications.

A text, i.e. a sequence of bytes, is represented as a double linked
list of pieces each with a pointer into a buffer and an associated
length. Pieces are never deleted but instead always kept around for
redo/undo support. A span is a range of pieces, consisting of a start
and end piece. Changes to the text are always performed by swapping
out an existing, possibly empty, span with a new one.

An empty document is represented by two special sentinel pieces which
always exist:

    /-+ --> +-\
    | | | |
    \-+ <-- +-/
     #1 #2

Loading a file from disk is as simple as mmap(2)-ing it into a buffer,
creating a corresponding piece and adding it to the double linked list.
Hence loading a file is a constant time operation i.e. independent of
the actual file size (assuming the operating system uses demand paging).

    /-+ --> +-----------------+ --> +-\
    | | | I am an editor! | | |
    \-+ <-- +-----------------+ <-- +-/
     #1 #3 #2

Insert
------

Inserting a junk of data amounts to appending the new content to a
modification buffer. Followed by the creation of new pieces. An insertion
in the middle of an existing piece requires the creation of 3 new pieces.
Two of them hold references to the text before respectively after the
insertion point. While the third one points to the newly added text.

    /-+ --> +---------------+ --> +----------------+ --> +--+ --> +-\
    | | | I am an editor| |which sucks less| |! | | |
    \-+ <-- +---------------+ <-- +----------------+ <-- +--+ <-- +-/
     #1 #4 #5 #6 #2

           modification buffer content: "which sucks less"

During this insertion operation the old span [3,3] has been replaced
by the new span [4,6]. Notice that the pieces in the old span were not
changed, therefore still point to their predecessors/successors, and can
thus be swapped back in.

If the insertion point happens to be at a piece boundary, the old span
is empty, and the new span only consists of the newly allocated piece.

Delete
------

Similarly a delete operation splits the pieces at appropriate places.

    /-+ --> +-----+ --> +--+ --> +-\
    | | | I am| |! | | |
    \-+ <-- +-----+ <-- +--+ <-- +-/
     #1 #7 #6 #2

Where the old span [4,5] got replaced by the new span [7,7]. The underlying
buffers remain unchanged.

Cache
-----

Notice that the common case of appending text to a given piece is fast
since, the new data is simply appended to the buffer and the piece length
is increased accordingly. In order to keep the number of pieces down,
the least recently edited piece is cached and changes to it are done
in place (this is the only time buffers are modified in a non-append
only way). As a consequence they can not be undone.

Undo/redo
---------

Since the buffers are append only and the spans/pieces are never destroyed
undo/redo functionality is implemented by swapping the required spans/pieces
back in.

As illustrated above, each change to the text is recorded by an old and
a new span. An action consists of multiple changes which logically belong
to each other and should thus also be reverted together. For example
a search and replace operation is one action with possibly many changes
all over the text. Actions are kept in an undo respectively redo stack.

A state of the text can be marked by means of a snapshotting operation.
The undo/redo functionality operates on such marked states and switches
back and forth between them.

The history is currently linear, no undo / history tree is implemented.

Properties
----------

The main advantage of the piece chain as described above is that all
operations are performed independent of the file size but instead linear
in the number of pieces i.e. editing operations. The original file buffer
never changes which means the mmap(2) can be performed read only which
makes optimal use of the operating system's virtual memory / paging system.

The maximum editable file size is limited by the amount of memory a process
is allowed to map into its virtual address space, this shouldn't be a problem
in practice. The whole process assumes that the file can be used as is.
In particular the editor assumes all input and the file itself is encoded
as UTF-8. Supporting other encodings would require conversion using iconv(3)
or similar upon loading and saving the document, which defeats the whole
purpose.

Similarly the editor has to cope with the fact that lines can be terminated
either by \n or \n\r. There is no conversion to a line based structure in
place. Instead the whole text is exposed as a sequence of bytes. All
addressing happens by means of zero based byte offsets from the start of
the file.
  
The main disadvantage of the piece chain data structure is that the text
is not stored contiguous in memory which makes seeking around somewhat
harder. This also implies that standard library calls like regex(3)
functions can not be used as is. However this is the case for all but
the most simple data structures used in text editors.

Screen Drawing
==============

The current code takes a rather simple approach to screen drawing. It
basically only remembers the starting position of the area being shown.
Then fetches a "screen full" of bytes and outputs one character at a
time until the end of the window is reached. A consequence of this
approach is that lines are always wrapped and horizontal scrolling is
not supported.

No efforts are made to reduce the terminal output. This task is delegated
to the underlying curses library which already performs a kind of double
buffering. The window is always redrawn completely even if only a single
character changes. It turns out this is actually necessary if one wants
to support multiline syntax highlighting.

While outputting the individual characters a cell matrix is populated
where each entry stores the length in bytes of the character displayed
at this particular cell. For characters spanning multiple columns the
length is always stored in the leftmost cell. As an example a tab has a
length of 1 byte followed by up to 7 cells with a length of zero.
Similarly a \n\r line ending occupies only one screen cell but has a
length of 2.

This matrix is actually stored per line inside a double linked list of
structures representing screen lines. For each line we keep track of
the length in bytes of the underlying text, the display width of all
characters part of the line, and the logical line number.

All cursor positioning is always performed in bytes from the start of
the file and works by traversing the double linked list of screen lines
until the correct line is found. Then the cell array is consulted to
move to the correct column.

Syntax-Highlighting
-------------------

The editor takes a similar regex-based approach to syntax highlighting
than sandy and reuses its syntax definitions but always applies them to
a "screen full" of text thus enabling multiline coloring.

Currently only the highlighting rules for C have been imported from sandy.

Window-Management
-----------------

It is possible to open multiple windows via the :split/:vsplit/:open
commands or by passing multiple files on the command line.

In principle it would be nice to follow a similar client/server approach
as sam/samterm i.e. having the main editor as a server and each window
as a separate client process with communication over a unix domain socket.

That way window management would be taken care of by dwm or dvtm and the
different client processes would still share common cut/paste registers
etc.

However at the moment I don't want to open that can of worms and instead
settled for a single process architecture.

Search and replace
------------------

This is one of the last big conceptual problems.

Currently the editor copies the whole text to a contiguous memory block
and then uses the standard regex functions from libc. Clearly this is not
a satisfactory solution for large files and kind of defeats the whole
effort spent on the piece table.

The long term solution is to write our own regular expression engine or
modify an existing one to make use of the iterator API. This would allow
efficient search without having to double memory consumption. At some
point I will have to (re)read the papers of Russ Cox[2] and Rob Pike
about this topic.

Command-Prompt
--------------

The editor needs some form of command prompt to get user input
(think :, /, ? in vim).

At first I wanted to implement this in terms of an external process,
similar to the way it is done in sandy with communication back to the
editor via a named pipe.

At some point I thought it would be possible to provide all editor commands
as shell scripts in a given directory, then set $PATH accordingly and run
the shell. This would give us readline editing, tab completion, history and
Unicode support for free. But unfortunately it won't work due to quoting
issues and other conflicts of special symbols with different meanings.

Later it occurred to me that the editor prompt could just be treated as
special 1 line file. That is all the main editor functionality is reused
with a slightly different set of key bindings.

This approach also has the added benefit of further testing the main editor
component (in particular corner cases like editing at the end of the file).

Editor Frontends
================

The editor core is written in a library like fashion which should make
it possible to write multiple frontends with possibly different user
interfaces/paradigms.

At the moment there exists a barely functional, non-modal nano/sandy
like interface which was used during early testing. The default interface
is a vim clone called vis.

The frontend to run is selected based on the executable name.

Key binding modes
-----------------

The available key bindings for the different modes are arranged in a
hierarchical way in config.h (there is also an ascii tree giving an
overview in that file). Each mode can specify a parent mode which is
used to look up a key binding if it is not found in the current mode.
This reduces redundancy for keys which have the same meaning in
different modes.

Each mode can also provide hooks which are executed upon entering/leaving
the mode and when there was an unmatched key.

vis a vim like frontend
-----------------------

The vis frontend uses a similar approach to the one suggested by Markus
Teich[3] but it turns out to be a bit more complicated. For starters
there are movements and commands which consist of more than one key/
character. As a consequence the key lookup is not a simple array
dereference but instead the arrays are looped over until a match
is found.

The following section gives a quick overview over various vim features
and their current support in vis.

 Operators
 ---------
   working: d (delete), c (change), y (yank), p (put)
   planned: > (shift-right), < (shift-left)
            those depend on handling of indention tabs <-> spaces

 Movements
 ---------
   h (char left)
   l (char right)
   j (line down)
   k (line up)
   0 (start of line)
   ^ (first non-blank of line)
   g_ (last non-blank of line)
   $ (end of line)
   % (match bracket)
   b (previous start of a word)
   w (next start of a word)
   e (next end of a word)
   ge (previous end of a word)
   { (previous paragraph)
   } (next paragraph)
   ( (previous sentence)
   ) (next sentence)
   gg (begin of file)
   G (goto line or end of file)
   | (goto column)
   n (repeat last search forward)
   N (repeat last search backwards)
   f{char} (to next occurrence of char to the right)
   t{char} (till before next occurrence of char to the right)
   F{char} (to next occurrence of char to the left)
   T{char} (till before next occurrence of char to the left)
   /{text} (to next match of text in forward direction)
   ?{text} (to next match of text in backward direction)

  There is currently no distinction between what vim calls a WORD and
  a word, only the former is implemented. Though infrastructure for
  the latter also exists.

  The semantics of a paragraph and a sentence is also not always 100%
  the same as in vim.

  Some of these commands do not work as in vim when prefixed with a
  digit i.e. a multiplier. As an example 3$ should move to the end
  of the 3rd line down. The way it currently behaves is that the first
  movement places the cursor at the end of the current line and the last
  two have thus no effect.

  In general there are still a lot of improvements to be made in the
  case movements are forced to be line or character wise. Also some of
  them should be inclusive in some context and exclusive in others.
  At the moment they always behave the same.

 Text objects
 ------------

  All of the following text objects are implemented in an inner variant
  (prefixed with 'i') and a normal variant (prefixed with 'a'):

   w word
   s sentence
   p paragraph
   [,], (,), {,}, <,>, ", ', ` block enclosed by these symbols

  For word, sentence and paragraph there is no difference between the
  inner and normal variants.

 Modes
 -----

  At the moment there exists a more or less functional insert, replace
  and character wise visual mode.

  A line wise visual mode is planned.

 Marks
 -----

  Only the 26 lower case marks [a-z] are supported. No marks across files
  are supported. Marks are not preserved over editing sessions.

 Registers
 ---------

  Only the 26 lower case registers [a-z] and 1 additional default register
  is supported.

 Undo/Redo and Repeat
 --------------------
 
  The text is currently snapshoted whenever an operator is completed as
  well as when insert or replace mode is left. Additionally a snapshot
  is also taken if in insert or replace mode a certain idle time elapses.

  Another idea is to snapshot based on the distance between two consecutive
  editing operations (as they are likely unrelated and thus should be
  individually reversible).

  The repeat command '.' currently only works for operators. This for
  example means that inserting text can not be repeated (i.e. inserted
  again). The same restriction also applies to commands which are not
  implemented in terms of operators, such as 'o', 'O', 'J' etc.

 Command line prompt
 -------------------

  At the ':'-command prompt only the following commands are recognized:

   :nnn go to line nnn
   :edit replace current file with a new one or reload it from disk
   :open open a new window
   :qall close all windows, exit editor
   :quit close currently focused window
   :read insert content of another file at current cursor position
   :split split window horizontally
   :vsplit split window vertically
   :wq write changes then close window
   :write write current buffer content to file

  The substitute command is recognized but not yet implemented. The '!'
  command to filter text through an external program is also planned.
  At some point the range syntax should probably also be supported.

  History support, tab completion and wildcard expansion are other
  worthwhile features.

 Tab <-> Space
 -------------

  Currently there is no expand tab functionality i.e. they are always
  inserted as is. For me personally this is no problem at all. Tabs
  should be used for indention! That way everybody can configure their
  preferred tab width whereas spaces should only be used for alignment.

 Jump list and change list
 -------------------------

  Neither the jump list nor the change lists are currently supported.

 Mouse support
 -------------

  The mouse is currently not used at all.
  
 Other features
 --------------

  Other things I would like to add in the long term are:

   + code completion: this should be done as an external process. I will
     have to take a look at the tools from the llvm / clang project. Maybe
     dvtm's terminal emulation support could be reused to display an
     slmenu inside the editor at the cursor position?

   + something similar to vim's quick fix functionality

  Stuff which vim does which I don't use and have no plans to add:

   - GUIs (neither x11, motif, gtk, win32 ...)
   - text folding
   - visual block mode
   - plugins (certainly not vimscript, if anything it should be lua based)
   - runtime key bindings
   - right-to-left text
   - tabs (as in multiple workspaces)
   - ex mode
   - macro recording
 
How to help?
------------

At this point it might be best to fetch the code, edit some scratch file,
notice an odd behavior or missing functionality, write and submit a patch
for it, then iterate.

WARNING: There are probably still some bugs left which could corrupt your
         unsaved changes. Use at your own risk. At this point I suggest to
         only edit non-critical files which are under version control and
         thus easily recoverable!

                    git clone git://repo.or.cz/vis.git
 
A quick overview over the code structure to get you started:

 config.def.h definition of key bindings, commands, syntax highlighting etc.
 vis.c vi(m) specific editor frontend, program entry point
 editor.[ch] screen / window / statusbar / command prompt management
 window.[ch] window drawing / syntax highlighting / cursor placement
 text-motions.[ch] movement functions take a file position and return a new one
 text-objects.[ch] functions take a file position and return a file range
 text.[ch] low level text / marks / {un,re}do / piece table implementation
 
Hope this gets the interested people started. Feel free to ask questions
if something is unclear! There are still a lot of bugs left to fix, but
by now I'm fairly sure that the general concept should work.

As always, comments and patches welcome!

Cheers,
Marc

[0] http://www.inf.ethz.ch/personal/wirth/ProjectOberon/
[1] https://www.openhub.net/p/vim
[2] http://swtch.com/~rsc/regexp/
[3] http://lists.suckless.org/dev/1408/23219.html
-- 
 Marc André Tanner >< http://www.brain-dump.org/ >< GPG key: CF7D56C0
Received on Sat Sep 13 2014 - 16:01:15 CEST

This archive was generated by hypermail 2.3.0 : Sat Sep 13 2014 - 16:12:07 CEST