Page MenuHomeFreeBSD

vt: Improve input performances + various small fixes
ClosedPublic

Authored by dumbbell on May 4 2018, 9:26 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 17, 9:52 PM
Unknown Object (File)
Sun, Jan 12, 11:13 PM
Unknown Object (File)
Sun, Jan 12, 11:13 PM
Unknown Object (File)
Sun, Jan 12, 11:13 PM
Unknown Object (File)
Sun, Jan 12, 11:13 PM
Unknown Object (File)
Sun, Jan 12, 10:11 PM
Unknown Object (File)
Sun, Jan 5, 4:37 AM
Unknown Object (File)
Wed, Jan 1, 5:49 AM
Subscribers

Details

Summary

This patch contains several changes:

  1. It changes locking on input to improve performances.
  2. It fixes the bug where the vt_flush() timer is resumed even when the change is made to another (hidden) window's output buffer.
  3. It fixes the VGA palette and color indexing.

The main part of the patch is the locking change. I compared syscons(4) and vt(4) input performances using the following command, from a remote SSH session, where both syscons(4) and vt(4) are configured to use text mode:

time cat find.txt >/dev/ttyv0

find.txt was created with:

find / > find.txt

In my benchmark, the file is about 26 MiB and 360000 lines.

Results:

  • syscons(4): 700 ms
  • vt(4): 1500 ms

In both cases, the fact that the window is the currently showed window or not doesn't impact the time taken.

Another difference is the rendering: while the content was scrolling on the screen,with syscons(4), frames look good. However with vt(4), they look "split" as if on a single line, there was several letters, then spaces, then several other letters. Once the file is entirely written to the terminal, the content is correct for both implementations.

After studying the code for both implementations, it looks like that:

  • syscons(4) acquires a single lock, does the whole input process (writes the character to the buffer, moves the cursor, possibly handles the newlines and line wraps by scrolling the content) and release the lock. The rendering thread uses the same lock to render the new content and that lock is held for the entire drawing process.
  • With vt(4), a lock is acquires by the upper layer, then vt(4) does the input process: it writes the character to the buffer, acquires and releases a lock to mark the region as out-of-date, moves the cursor, re-acquires the lock to mark the region as out-of-date, possible handles the newlines/line wraps where it acquires that lock again twice (copy screen, then fill the new line). Once vt(4) is finished, the upper layer releases its own lock. In addition, after each putchar/cursor move/copy/fill, there is an atomic cmpset to decide if the vt_flush() timer should be resumed. Last but not least, the upper layer re-acquires its lock a second time to call the tc_done() callback. The rendering thread acquires the vt(4) lock to read and reset the coordinates of the out-of-date region, releases it and does the drawing.

So in the end:

  • syscons(4) acquires 1 lock to process an input.
  • vt(4) acquires 4 locks and 2 atomic cmpsets to process an input, or 5 locks and 3 atomic cmpset if there is a newline.

The patch improves this situation by:

  • Introducing new tf_video_lock() and tf_video_unlock() callbacks: they allow vt(4) to acquire its internal lock once instead of each time one of the other callbacks is called.
  • Moving the atomic cmpset fo tf_video_unlock() so it's done once only.

After that, vt(4) is down to around 760 ms for the same test. So slightly slower than syscons(4), mainly because there are two locks, not one. It's already a 2 times improvement on my laptop.

For the rendering thread in vt(4), the lock is acquired earlier and released later, in particular after the drawing is finished. This fixes the weird frame rendering while scrolling because vt(4) won't write to the output buffer while the rendering thread is reading it.

I also ran several builkernels to see if the locking impacted a real usecase. I couldn't find any difference: no matter the console driver or the application of this patch or not, the buildkernel was always the same.

Now the smaller change included: when input is for a terminal which is not currently displayed, we don't try to resume the vt_flush() anymore. This was a waste of time and resources.

In the end, I will commit those changes in several commits, not everything at once.

Test Plan
  • Run a benchmark like the example to see if you get any improvements or regressions.
  • Just use the console, including ddb, and see if something is broken.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/dev/vt/vt_buf.c
52 ↗(On Diff #42157)

nit: please rename this to ASSERT_VTBUF_IS_LOCKED so that its more clear it is just an assertion

53 ↗(On Diff #42157)

also, why bother defining this at all? Was this just a debugging artifact?

258–270 ↗(On Diff #42157)

Why do we use functions here and macros for other things like vtbuf_is_locked? Can we not just use the macro directly?

I have a mild preference for more functions and fewer macros (i.e., inline the macro into this function) as well.

Replace tc_prepare() by two callbacks dedicated to locking

Moving the TERMINAL_LOCK outside the loop in termtty_outwakeup() apaprently broke the locking model. When starting nvim (editors/neovim) on window 0, the laptop would reboot without further notice, not even a panic. This was ok on other windows because that lock is special in the case of the window used as the console. My guess is that holding TERMINAL_LOCK() while calling ttydisc_getc() was bad.

I couldn't find the real explanation behind this, but eventually, I changed the solution by adding two new callbacks (tf_video_lock() and tf_video_unlock()) instead of introducing the tc_prepare() callback and abusing tc_done(). That's what syscons(4) does in fact and the performance benefit is the same in the end.

The fix to the VGA palette was committed, thus it's gone from this review.

@eadler: I believe I adressed all your concerns.

I updated the review's description to reflect the new state. Also, my benchmark was wrong initially: I was comparing unmodified vt(4) in GENERIC and modified vt(4)/syscons(4) with GENERIC-NODEBUG. The numbers in the description are now based on GENERIC-NODEBUG. So an unmodified vt(4) takes 1500 ms to display the test file, while the patched one takes 760 ms.

This revision was not accepted when it landed; it landed in state Needs Review.May 16 2018, 9:01 AM
This revision was automatically updated to reflect the committed changes.