Page MenuHomeFreeBSD

syslogd: Support locale when removing unsafe characters
Needs ReviewPublic

Authored by on Sep 16 2020, 5:36 PM.



From :

syslogd has code to convert all logged messages to "safe" strings.

At the moment, the code converts control characters to "^x" sequences and ALL 8-bit characters to "M-x" sequences. This means that printable characters in character sets other than ASCII are converted and so do not display as expected when viewing the logs.

This patch adds LC_CTYPE locale support to syslogd and changes the "safe" conversion code to examine the logged characters using mbrtoc32() and to use iswgraph() to test if a character needs converting to safe sequences.

It also uses vis() to do the conversion which is similar to OpenBSD but which means control chars become \^x and non-graphical 8-bit chars become \M-x.

Test Plan

simplest way is this:

  1. Add to /etc/rc.conf: syslogd_env="LC_CTYPE=C.UTF-8"
  1. Restart syslogd.
  1. Run: echo '\xe0\xb8\xaa\xe0\xb8\xa7\xe0\xb8\xb1\xe0\xb8\xaa\xe0\xb8\x94\xe0\xb8\xb5' | logger
  1. Look at /var/log/messages. You should see "สวัสดี" (hello in Thai).

Diff Detail

Lint Skipped
Unit Tests Skipped

Event Timeline

The only change I made to the original patch was to re-order some of the #include directives, as more had been added after the original patch was submitted.


mask_C1 is now a write-only variable. What does vis() do with C1 control characters?


Is the cast to size_t actually required?

Don't we need to limit that parameter to min(MB_LEN_MAX, <# bytes left>)?


When can mbrtoc32() return a value greater than n? It's not clear to me from reading the man page that that's permitted.


This might as well be a memcpy(). strncpy() doesn't guarantee nul-termination here, the assignment after the loop does.

I tested it more thoroughly and found an additional problem (on line 984, it used to test if a size_t was less than 0), along with the ones markj pointed out. They should all be fixed now.


From my testing, vis() automatically converts C1 characters. I added back the old behaviour for when mask_C1 is false.


end_q - q measures the number of bytes remaining in the output buffer, but we're passing the input buffer to mbrtoc32(). In particular, mbrtoc32() assumes that it can access up to n bytes of the input buffer, which is not always true.

max_len should be of type size_t.

Fixed another issue: mbrtoc32() returns the error value (size_t)-2 if it reads a C1 control character. This wouldn't be a problem by itself, as the rest of the function takes care of it, but this screws up mbrtoc32()'s internal state for all of the characters afterwards. Switching to the non-restartable mbtowc() resolves that issue. However, I am not sure if this has any bad implications.


It's better to write end_p = p + strlen(p) and check end_p - p rather than recomputing the entire string's length for each wide character.


Just setlocale(LC_CTYPE, "") probably.

Only three locales are defined by default, the empty string "" which
denotes the native environment, and the "C" and "POSIX" locales, which
denote the C language environment.  A locale argument of NULL causes
setlocale() to return the current locale.

If LC_CTYPE is unset getenv returns null, which causes setlocale() to just return the current setting. The empty string sets it according to environment variables automatically.

Seems reasonable to me.


You could rewrite this as while (p < end_p && q < end_q) I think.

This revision is now accepted and ready to land.Sep 22 2020, 1:39 PM
This revision now requires review to proceed.Sep 22 2020, 6:48 PM

(I plan to commit this, but I would like to finish some work on sandboxing syslogd first - all of this string parsing code makes me a bit nervous.)