This patch contains several changes:
- It changes locking on input to improve performances.
- It fixes the bug where the vt_flush() timer is resumed even when the change is made to another (hidden) window's output buffer.
- It fixes the VGA palette and color indexing.
The main part of the patch is the locking change. I compared syscons(4) and vt(4) input performances using the following command, from a remote SSH session, where both syscons(4) and vt(4) are configured to use text mode:
time cat find.txt >/dev/ttyv0
find.txt was created with:
find / > find.txt
In my benchmark, the file is about 26 MiB and 360000 lines.
Results:
- syscons(4): 700 ms
- vt(4): 1500 ms
In both cases, the fact that the window is the currently showed window or not doesn't impact the time taken.
Another difference is the rendering: while the content was scrolling on the screen,with syscons(4), frames look good. However with vt(4), they look "split" as if on a single line, there was several letters, then spaces, then several other letters. Once the file is entirely written to the terminal, the content is correct for both implementations.
After studying the code for both implementations, it looks like that:
- syscons(4) acquires a single lock, does the whole input process (writes the character to the buffer, moves the cursor, possibly handles the newlines and line wraps by scrolling the content) and release the lock. The rendering thread uses the same lock to render the new content and that lock is held for the entire drawing process.
- With vt(4), a lock is acquires by the upper layer, then vt(4) does the input process: it writes the character to the buffer, acquires and releases a lock to mark the region as out-of-date, moves the cursor, re-acquires the lock to mark the region as out-of-date, possible handles the newlines/line wraps where it acquires that lock again twice (copy screen, then fill the new line). Once vt(4) is finished, the upper layer releases its own lock. In addition, after each putchar/cursor move/copy/fill, there is an atomic cmpset to decide if the vt_flush() timer should be resumed. Last but not least, the upper layer re-acquires its lock a second time to call the tc_done() callback. The rendering thread acquires the vt(4) lock to read and reset the coordinates of the out-of-date region, releases it and does the drawing.
So in the end:
- syscons(4) acquires 1 lock to process an input.
- vt(4) acquires 4 locks and 2 atomic cmpsets to process an input, or 5 locks and 3 atomic cmpset if there is a newline.
The patch improves this situation by:
- Introducing new tf_video_lock() and tf_video_unlock() callbacks: they allow vt(4) to acquire its internal lock once instead of each time one of the other callbacks is called.
- Moving the atomic cmpset fo tf_video_unlock() so it's done once only.
After that, vt(4) is down to around 760 ms for the same test. So slightly slower than syscons(4), mainly because there are two locks, not one. It's already a 2 times improvement on my laptop.
For the rendering thread in vt(4), the lock is acquired earlier and released later, in particular after the drawing is finished. This fixes the weird frame rendering while scrolling because vt(4) won't write to the output buffer while the rendering thread is reading it.
I also ran several builkernels to see if the locking impacted a real usecase. I couldn't find any difference: no matter the console driver or the application of this patch or not, the buildkernel was always the same.
Now the smaller change included: when input is for a terminal which is not currently displayed, we don't try to resume the vt_flush() anymore. This was a waste of time and resources.
In the end, I will commit those changes in several commits, not everything at once.