Page MenuHomeFreeBSD

PR208117: regex(3): implement equivalence classes
AcceptedPublic

Authored by yuripv on Dec 12 2018, 4:28 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Jan 20, 10:38 PM
Unknown Object (File)
Dec 17 2024, 7:34 AM
Unknown Object (File)
Nov 30 2024, 3:22 AM
Unknown Object (File)
Nov 26 2024, 11:09 AM
Unknown Object (File)
Nov 23 2024, 8:35 AM
Unknown Object (File)
Nov 21 2024, 1:27 PM
Unknown Object (File)
Nov 21 2024, 3:47 AM
Unknown Object (File)
Nov 10 2024, 4:07 PM
Subscribers

Details

Reviewers
bapt
kevans
pfg
kib
Summary

regex: implement equivalence classes.

We already had the meaty part in __collate_equiv_value() that came with locale re-work from DFBSD, but was unused. Modify it to our needs making it accept just the single wide character as argument.

With that in place, the change is pretty simple -- parse equivalence-class bracketed expression, put resulting primary collating value to new filed 'cequiv' in cset struct, don't let p_bracket() "optimize" the resulting singleton, and, finally, do the collating values comparison first one in CHIN().

Test Plan

See added test case.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

yuripv added inline comments.
lib/libc/locale/xlocale_private.h
90

This looks unrelated, though done to silence the warning in libregex as it looks better than silencing the compiler via -Wno-...:

cc  -O2 -pipe   -I/home/yuripv/ws/mbregex/contrib/netbsd-tests/lib/libc/regex -I/home/yuripv/ws/mbregex/lib/libc/regex -g -MD  -MF.depend.h_regex.debug.o -MTdebug.o -std=gnu99 -fstack-protector-strong    -Qunused-arguments  -c /home/yuripv/ws/mbregex/contrib/netbsd-tests/lib/libc/regex/debug.c -o debug.o
In file included from /home/yuripv/ws/mbregex/contrib/netbsd-tests/lib/libc/regex/debug.c:41:
In file included from /home/yuripv/ws/mbregex/lib/libc/regex/regex2.h:39:
In file included from /home/yuripv/ws/mbregex/lib/libc/regex/../locale/collate.h:44:
/home/yuripv/ws/mbregex/lib/libc/regex/../locale/xlocale_private.h:142:18: warning: passing 'long *' to parameter of type 'volatile u_long *'
      (aka 'volatile unsigned long *') converts between pointers to integer types with different sign [-Wpointer-sign]
        atomic_add_long(&(obj->retain_count), 1);
                        ^~~~~~~~~~~~~~~~~~~~
/usr/include/machine/atomic.h:451:1: note: passing argument to parameter 'p' here
ATOMIC_ASM(add,      long,  "addq %1,%0",  "er",  v);
^
/usr/include/machine/atomic.h:148:43: note: expanded from macro 'ATOMIC_ASM'
atomic_##NAME##_##TYPE(volatile u_##TYPE *p, u_##TYPE v)\
pfg added a reviewer: theraven.
This revision is now accepted and ready to land.Dec 12 2018, 4:54 PM
lib/libc/locale/xlocale_private.h
90

Reading the comment just above the struct definition (better late than never), I now don't think that this change is a good idea (or even correct).. It would be nice if @theraven could comment on this.

yuripv edited reviewers, added: kib; removed: theraven.

Drop the xlocale_private.h change, it's not really needed.

Provide private __collate_equiv_value symbol and add Konstantin as reviewer as I'm not really sure it's acceptable after reading https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=232282#c4. If there are strong objections to doing this, I'll change this to only affect libc version, and libregex will behave as before doing strict character code comparison as I don't see any other way of providing collation equivalent value outside of libc.

This revision now requires review to proceed.Jan 1 2019, 7:40 AM

I do not have any additional useful comments. The only thing I could suggest is consider moving libregex from its own pkg for pkgbase to clibs, due to the dependency on the private symbol. I am not sure what is the use of libregex in the base system, i.e. how critical is to have it operational for successful upgrade.

FWIW, I have a patch from the late ache@ to finish the removal of __collate_range_cmp() started in r302512.
It is practically impossible to remove a symbol as that would in theory break the ABI, so I understand we may be wary of adding new ones.

Thinking about functionailty first though .... I wish we had some way to trade and it would be less difficult to add/remove private symbols. (Just wishful thinking, no solution).

@kevans any thoughts here? I'm still hesitant to add the symbol, so are you ok with the change only being in libc and libregex using the old behavior (strict character comparison instead of equivalence class)?

@kevans any thoughts here? I'm still hesitant to add the symbol, so are you ok with the change only being in libc and libregex using the old behavior (strict character comparison instead of equivalence class)?

I think I'm generally ok with it; my only concern is that my current plan is for bsdgrep to link against libregex by default after a battery of exp-runs for all of my libregex stuff is completed, so default bsdgrep won't have equivalence class support. I plan to set that up properly behind MK_GNU_GREP_COMPAT, though, so one can decide if that's a trade-off worth having (gnu extensions vs. equivalence classes). I'd say it's likely that this won't cause too much pain, because the GNU extensions aren't all that great and expressions can fairly easily be rewritten to not use them if equivalence classes are necessary.

I also plan to find a replacement implementation for libregex at the very least -- my eyeball's still on Onigmo if I can figure out what I thought was wrong with it, but I'll need to re-implement equivalence classes in it from the looks of it. I suspect at this point the plan will be to replace the libregex implementation with it, bring that implementation up to par, then swap out libc regex(3) once I've got Onigmo to feature parity.

This revision is now accepted and ready to land.Feb 17 2019, 11:11 AM