Page MenuHomeFreeBSD

New cmb(3) library and cmb(1) utility
Needs ReviewPublic

Authored by dteske on Jul 5 2018, 3:14 AM.

Details

Reviewers
emaste
adrian
eadler
0mp
bcr
imp
Group Reviewers
manpages
Summary

Import combinatorics library/utility

Here is an HTML formatted preview of the library manual:
https://fraubsd.org/doc/cmb.3.html

And utility manual:
https://fraubsd.org/doc/cmb.1.html

Over 30 years in the making, this is not just a cute tool to generate combinations. It is both a novel algorithm and the fastest implementation we have for complex combinatorics. It is faster than python's itertools, faster than numpy, faster than Perl, faster than Perl XS, faster than comb in R, faster than every library I have found and tested over the past 5 years.

Arch Linux has already ported an earlier version of this work to their Operating System.

Equipping the FreeBSD base with this utility/library allows us to take on new exciting possibilities such as generating a full-and-complete build-option-survey. However, one of the more exciting propositions is the idea that we could generate multiple binary packages from a single port that has multiple compile-time options. That way, we could support allowing people to download packages that were compiled with the options they desire instead of punting to require a src-compile of a port every time non-standard options are desired.

I have baked into the cmb suite a lot of functionality such as the ability to resume where you left-off if you prematurely terminate a run and other mathematical solutions. So taking on large tasks such as "compile the base OS with every possible combination of build-time options" is a possible endeavor.

While I do have additional features planned, such as dynamic shared objects loaded at runtime for defining custom behavior, I have decided that 30 years of development is enough and that the initial offering is more than complete -- further features can be added later since the core underpinnings are currently solid.

Test Plan

Use it to generate combinations.

See the examples in the utility manual:
https://fraubsd.org/doc/cmb.1.html

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 27882
Build 26053: arc lint + arc unit

Event Timeline

Ping. Reasons for having this in base are similar to basic reasoning for jot, seq, and other argument processing/generating utilities. cmb fills a gap in UNIX that should have been filled decades ago but nobody gave the effort (I suspect because it is a very difficult problem to solve; and given the complexity required, I can absolutely see why nobody has ever BSD-licensed such a tool).

Update patch for 13.0, bug fix, and libname edits

Make integer overflow with -t print -1 under MK_OPENSSL=no

Remove WARNS=6 (this is the default)

adridg added inline comments.
lib/libcmb/cmb.c
2

The year range looks unlikely (but possible .. it doesn't match cmb.h in any case)

lib/libcmb/cmb.h
39

If this is an installed header for consumption by other source in the base tree, won't this clash with other HAVE_CONFIG_H definitions? (infiniband/opensm seems to have one).

63

These are (static, internal) variables for cmb.c, right? They should live there, not in the header.

78

It looks like you want *either* the BN approach, or the uint64_t approach (e.g. in cmb.c only one of the two members is initialized). If you stick count and start in an #else you can avoid uninitialized fields in this struct.

0mp requested changes to this revision.Oct 30 2018, 1:47 PM
0mp added a subscriber: 0mp.
  1. I've heard that All rights reserved. is no longer needed.
  2. It would be nice if you could add an example or two to the manual page.

Also, is there a pattern for the use of Qq, Dq and Ql (instead of Dq Li)?

lib/libcmb/cmb.3
25

Missing $FreeBSD$.

This revision now requires changes to proceed.Oct 30 2018, 1:47 PM
dteske added inline comments.
lib/libcmb/cmb.c
2

I am seeing that libcmb/cmb.c and libcmb/cmb.h match:

From libcmb/cmb.c:

* Copyright (c) 2002-2018 Devin Teske <dteske@FreeBSD.org>

From libcmb/cmb.h:

* Copyright (c) 2002-2018 Devin Teske <dteske@FreeBSD.org>

Which is important because the first version of the code started its life as a utility for compiling all possible binary combinations of a given software, released on April 4, 2002 [1].

The reason why the copyrights in libcmb/cmb.c and libcmb/cmb.h (which read 2002-2018) differ from that of the other copyrights is because the libcmb/cmb.[ch] code is directly derived from my longterm work [2] while code with a copyright of 2018 is new code not drawn from the historical body of work.

[1] http://synhxd.sourceforge.net/doc/changes.txt (Excerpt below)

Version 0.2.2 April 4, 2002
===========================
- changed output of `/version' from `hxd ver...' to `shxd ver...'
  (since both shxd and hxd are in the 0.2.x version system now)
- created a new mkbin script (read the agreement in the source of
  the file, or don't run it)(file is `utils/bashmkbin')

[2] bashmkbin de-obfuscated, optimized, and ported to bourne shell

https://pastebin.com/6BXx8kTq

dteske added inline comments.
lib/libcmb/cmb.h
39

Not an installed header but a header generated by the cmb-specific configure scripts [1] [2] for compiling on platforms other than FreeBSD.

libcmb and cmb are compatible with Mac OS X and Linux with the correct compile flags.

There is no conflict because HAVE_CONFIG_H is something you apply to localized CFLAGS when (and only when) you have a config.h file for compiling the software on some unknown (but POSIX compliant) platform. You wouldn't ever define HAVE_CONFIG_H globally and the reason you see it in infiniband/opensm is for the same reasons (compatibility with an autotools build system that is unused in FreeBSD base). If you grep globally for HAVE_CONFIG_H in the base, you'll find that it's largely just cross-platform compatible software that makes use of it (89.36% of which is in contrib/).

Here we are largely:

  • Keeping the code compatible with autotools build system
  • Minimizing diffs when it compares to cross-platform compatibility
  • Leaving the configure and *.in files out of the FreeBSD tree

Another option would be to move libcmb and cmb into contrib including the autotools files (configure, configure.in, GNUmakefile.in, and config.h.in). However, I figured it would be more FreeBSD-like to drop the autotools despite the fact that it too works on FreeBSD if, say, for some reason you wanted to use gmake outside of the normal build process.

[1] https://github.com/FrauBSD/pkgcenter/blob/master/depend/libcmb/configure.in

[2] https://github.com/FrauBSD/pkgcenter/blob/master/depend/cmb/configure.in

dteske added inline comments.
lib/libcmb/cmb.h
78

Providing both allows programs linking to the library to choose between 64-bit and bn(3)

In D16132#379579, @0mp wrote:

Also, is there a pattern for the use of Qq, Dq and Ql (instead of Dq Li)?

Generally speaking you would combine "Li" with one of Ql, Dq, or Qq when you want the text to appear as-is within the selected quotation style (Qq and Dq producing double-quotes while Ql produces single).

We are essentially telling the parser not to potentially interpret things like the dash. While it appears as though "xargs -0" is treated the same between "Dq" and "Dq Li", the "Li" adds a layer of safety to make sure that it appears as written (should somebody want to copy/paste the partial command appearing between the quotes).

Address feedback and minor improvements:
+ Move static variables out of header (adridg)
+ Remove "All rights reserved" from copyrights (0mp)
+ Add EXAMPLES section to cmb(1) manual (0mp)
+ Add missing $FreeBSD$ (0mp)
+ Update cmb(1,3) HISTORY section [s/12.0/13.0/]
+ Improved error checking for 64-bit integer overflow
+ Fix copy-paste error in cmb(3) manual (/#endif/d)
+ Use errno(2) instead of errx(3) in cmb(3) library
+ Sync with upstream version 0.7 of [lib]cmb
+ Fix style(9) issue of single-line if-statements
+ Add missing $FrauBSD$ keyword to cmb(1) manual
+ Remove redundant RCS keywords from cmb(1)

cmb(1): Add missing -n option to usage and give a more informative usage
cmb(3): Fix style issue of single-line for-loops
cmb(3): API changes to cmb_print(1) and struct cmb_config

API Change:

Change the number of arguments that are passed to the action function
(default being cmb_print()). Introduce new first argument (changing
the total number of arguments from 2 to 3), struct cmb_config *config.

This allows me to eliminate the static globals in cmb.c to prevent
race-conditions in multi-threaded applications. It also cleans up the
processing of the config struct while improving accuracy of the manual
(which talks of cmb_print()'s direct use of config, not globals which
were only updated by a call to either cmb() or cmb_bn()).

Last, but not least, this opens the ability to use cmb_print() without
first calling cmb() or cmb_bn().

It may be waaay too basic for you, but Dennis Stanton & Dennis White, "Constructive Combinatorics", Springer Verlag, 1986, might be interesting. ISBN 0-387-96347-2. It discusses many ways of generating basic combinatorial objects (e.g. permutations, subsets, etc).

Comments in general not tied to a specific line of code:

  • The -c and -i flags ask for a deterministic order of the combinations. If I use -c 4 -i 1 I expect 4 combinations (whatever they may be). If I use -c 3 -i 2 I expect the last three of those four combinations. So the ordering must be consistent. It is, in your implementation, but the documentation doesn't guarantee that. The ordering produced is lexicographic, considering the order the item... are given on the command-line. The documentation for -i could say something like Skip the first num-1 combinations that are generated .. "position" isn't really well-defined. The example in the manpage does clear this up.
  • -k can be described as ther size of the combinations produced, and it also accepts just a plain integer, so -k 2 is equivalent to -k 2 2.
  • -k also accepts reversed min and max and then "counts down", in size of the combinations printed (e.g. 2..3 lists first the size-2 combinations, then the size-3 combinations, while 3..2 lists the size-3 first).
  • The description for -n doesn't say what happens if num is bigger than the number of item... on the command line; I'd describe it as Use the first num items given.
lib/libcmb/cmb.c
362

When iterating downwards, eg. using command-line options -k 3..2 this over-estimates the size remaining, so it will print garbage at the end (e.g. cmb -k 3..2 $( seq 1 4 ) prints bits of my environment and cmd -k 3..2 $( seq 1 3 ) bails out with Result too large). The symmetry with the nextset > 0 is broken which looks funny.

z=(z*--k)/++i fixes the issue.

dteske added inline comments.
lib/libcmb/cmb.c
362

Good catch! Fixed in the next update

Comments in general not tied to a specific line of code:

  • The -c and -i flags ask for a deterministic order of the combinations. If I use -c 4 -i 1 I expect 4 combinations (whatever they may be). If I use -c 3 -i 2 I expect the last three of those four combinations. So the ordering must be consistent. It is, in your implementation, but the documentation doesn't guarantee that.

You're right, I should update the documentation. This is usually well understood by those that study n-choose-k in combinatorics. I have long since forgotten what average people believe of combinations.

The ordering produced is lexicographic, considering the order the item... are given on the command-line. The documentation for -i could say something like Skip the first num-1 combinations that are generated .. "position" isn't really well-defined. The example in the manpage does clear this up.

I'll make it more like the examples section that I changed in last update, which is in-line with that "Skip ..." wording suggested.

  • -k can be described as ther size of the combinations produced, and it also accepts just a plain integer, so -k 2 is equivalent to -k 2 2.

I like your suggestion to use size.

  • -k also accepts reversed min and max and then "counts down", in size of the combinations printed (e.g. 2..3 lists first the size-2 combinations, then the size-3 combinations, while 3..2 lists the size-3 first).

Also, I support negative numbers too. A size of -1 will produce the combinations backward. A range of -1..2 will produce combinations backward but stop at size-2; meanwhile -2 is a pseudonym for -1..2 and has the same effect.

  • The description for -n doesn't say what happens if num is bigger than the number of item... on the command line; I'd describe it as Use the first num items given.

Thanks, will do by next update.

Update to address feedback from adridg

Bugs already fixed for next update:
+ -i NUM was broken with -k -1

Other changes in next update:
+ A new -e flag has been added to calculate the empty-set

+ cmb(1,3): Fix broken handling of -i num when given -k -1
+ cmb(3): Remove a stray newline
+ cmb(3): Add show_empty option for including the empty-set
+ cmb(1): Add -e option for showing the empty-set

Add support for negative `-i num' values
(for seeking to the back-end of a series)

Add more examples and update existing examples in the manual.

Add support for `-i random' for seeking to a random position.
Updated manual with new examples.

bcr added a subscriber: bcr.

OK from manpages.

Thanks! I have a huge update coming this month that will bring in lots of changes. Hoping you could revisit just afterward.

Overhaul [lib]cmb offerings to address previous feedback

This update also improves the initial offering by adding many
more features previously not thought-of as well as fixing
several bugs that were discovered with additional testing.

Tests were also added for libcmb. Documentation also improved
for both cmb(3) and cmb(1).

dteske edited the summary of this revision. (Show Details)
dteske edited the summary of this revision. (Show Details)
dteske edited the test plan for this revision. (Show Details)

In pkg we have a SAT solver. Can you explain how your CMB compares to a SAT solver?

In pkg we have a SAT solver. Can you explain how your CMB compares to a SAT solver?

"SAT solvers are algorithms for discovering satisfactory Boolean assignments – or proving that none exist. Such solvers are of growing practical importance, in part because a host of significant combinatorial problems can be reduced to SAT problems" [1]

Not all combinatorial problems have been reduced to SAT problems yet all SAT problems are in the realm of combinatorics.

CMB provides a generalized and abstract substrate to which one can build existing and not-yet-imagined solvers upon.
CMB is a powerful library interface and command-line utility that facilitates tackling a near-unlimited number of combinatorial problems.

[1] https://sinews.siam.org/Details-Page/donald-knuth-talks-satisfiability-and-combinatorics

Returning to this, as I still want to add it to base.

The salient point that I would like to make here, with respect to SAT solvers, is that:

CMB is not a solver

It was not written to solve equations. It was written to give scope on combinatoric problems and enumerate combinations for actions.

For example, the build-option survey is not trying to solve an equation but test for combinations of build-options that break the build when combined together.

In example, if one has 32,000 build-options, cmb can first tell you how many combinations pairs-of-2 would result in, allowing you to determine whether it is worth your time.

cmb -t -r -k 2 32000

The result of the above command tells me that there are 511,984,000 combinations for "choose-2 from 32,000 items."

So right there, cmb was able to answer a question of whether it's even worth doing a build-option survey of all-known options (which I am sure there are less than 32,000 build options) choosing only combinations-of-2.

I am not trying to solve an equation but get a handle on the scope of the issue which is wanting to know which pairings of build options are broken.

Secondly, once you land on a set of combinations that you want to iterate over, cmb will help you iterate over them (and also resume where you left-off if you prematurely terminate and also help you parallelize the process or even distribute it amongst multiple systems).

For example:

cmb -e -N -k 1..2 -- -{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p}

Produces:

1 
2 -a
3 -b
4 -c
5 -d
6 -e
7 -f
8 -g
9 -h
10 -i
11 -j
12 -k
13 -l
14 -m
15 -n
16 -o
17 -p
18 -a -b
19 -a -c
20 -a -d
21 -a -e
22 -a -f
23 -a -g
24 -a -h
25 -a -i
26 -a -j
27 -a -k
28 -a -l
29 -a -m
30 -a -n
31 -a -o
32 -a -p
33 -b -c
34 -b -d
35 -b -e
36 -b -f
37 -b -g
38 -b -h
39 -b -i
40 -b -j
41 -b -k
42 -b -l
43 -b -m
44 -b -n
45 -b -o
46 -b -p
47 -c -d
48 -c -e
49 -c -f
50 -c -g
51 -c -h
52 -c -i
53 -c -j
54 -c -k
55 -c -l
56 -c -m
57 -c -n
58 -c -o
59 -c -p
60 -d -e
61 -d -f
62 -d -g
63 -d -h
64 -d -i
65 -d -j
66 -d -k
67 -d -l
68 -d -m
69 -d -n
70 -d -o
71 -d -p
72 -e -f
73 -e -g
74 -e -h
75 -e -i
76 -e -j
77 -e -k
78 -e -l
79 -e -m
80 -e -n
81 -e -o
82 -e -p
83 -f -g
84 -f -h
85 -f -i
86 -f -j
87 -f -k
88 -f -l
89 -f -m
90 -f -n
91 -f -o
92 -f -p
93 -g -h
94 -g -i
95 -g -j
96 -g -k
97 -g -l
98 -g -m
99 -g -n
100 -g -o
101 -g -p
102 -h -i
103 -h -j
104 -h -k
105 -h -l
106 -h -m
107 -h -n
108 -h -o
109 -h -p
110 -i -j
111 -i -k
112 -i -l
113 -i -m
114 -i -n
115 -i -o
116 -i -p
117 -j -k
118 -j -l
119 -j -m
120 -j -n
121 -j -o
122 -j -p
123 -k -l
124 -k -m
125 -k -n
126 -k -o
127 -k -p
128 -l -m
129 -l -n
130 -l -o
131 -l -p
132 -m -n
133 -m -o
134 -m -p
135 -n -o
136 -n -p
137 -o -p

You can process the list sequentially or in-parallel or in distribution fashion because each combination has a deterministic index that is reproducible. Combination number 120 is "-j -n" for example. If you were processing this list of potential flag combinations (for whatever -- compiling, running, whatever) sequentially and you bailed on combination 120, you can resume from that point (example below):

cmb -e -N -k 1..2 -i 120 -- -{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p}

Produces:

120 -j -n
121 -j -o
122 -j -p
123 -k -l
124 -k -m
125 -k -n
126 -k -o
127 -k -p
128 -l -m
129 -l -n
130 -l -o
131 -l -p
132 -m -n
133 -m -o
134 -m -p
135 -n -o
136 -n -p
137 -o -p

Or if you just needed that single combination (for example, if you were processing combinations in-parallel or in distributed fashion):

cmb -e -k 1..2 -i 120 -c 1 -- -{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p}

Produces combination number 120:

-j -n

So while it is perhaps convenient to think that CMB is using combinatorics to solve equations and therefore can be relegated to being an insignificant offering (compared to SAT solvers), attempting to use a SAT solver to the above is a non-starter because SAT solvers are always trying to find short-cuts but for tasks like those above, there is no shortcut. In the above-case, you cannot know whether a combination of compile-time options are incompatible with each other unless you actually try and build with them.

When attempting to solve problems like that above, no tool exists in base to facilitate such testing. Not seq, nor jot, nor any other base command-line utility in FreeBSD, Linux, or any UNIX for that matter attempts to solve this problem.

It is in this light that I offer a solution that is easy to use, efficient, and powerful. Just like seq, jot, and other utilities which are not designed to solve a specific problem but provide the ability to perform required actions (specifically enumerative actions).

I'm going to set an ETIMEOUT of 30 days. If there is no objection, I will import to base HEAD.

Re-iterating the reasons why this should bypass ports and go to base:

This tool is being developed to ultimately aid base and ports functionality. If it exists in ports, it becomes like dialog4ports and has to be installed before you can extract all the functionality of ports.

I have ports that I want to use this in the Makefile and tools for base that I need to use this in (the build option survey, for example). Putting this in ports complicates those endeavors.

I'm going to set an ETIMEOUT of 30 days. If there is no objection, I will import to base HEAD.

Re-iterating the reasons why this should bypass ports and go to base:

This tool is being developed to ultimately aid base and ports functionality. If it exists in ports, it becomes like dialog4ports and has to be installed before you can extract all the functionality of ports.

I have ports that I want to use this in the Makefile and tools for base that I need to use this in (the build option survey, for example). Putting this in ports complicates those endeavors.

Hi Devin,

Could you elaborate more on your planned uses of this code in the base system? The build option survey looks to me like something that would live in tools/, in which case it is reasonable to depend on a port. We keep python scripts there, for example. The mailing list thread about cmb(1) and libcmb seemed to conclude with the suggestion that they be added as a port, but I can't seem to find one.

Hi markj,

Yes, it was suggested that I make a port first.

However, everybody seems to be ignoring what I am saying regarding the utility of this tool being equitable to seq, jot, and friends.

Just yesterday, I did the following:

cmb -d / {a..z} | xargs stat 2>&1 | dpv -l -

The purpose was to generate millions of stat errors so that I could work on some filesystem patches to improve the performance of stat in nfsd.

Could I have used jot? Maybe, but it would have been more difficult.
Could I have used seq? Maybe, but same problems as jot. I needed discretely unique test matter.

I have been ignoring the suggestion to put it into ports because not one single person has even addressed the use-cases for base.

How would one make an argument for jot or seq being in base? If you pretend that they are not in base, what arguments would you use to introduce them?

cmb is kind of like that -- it has a million uses. Asking why we should have it in base is kind of like asking why we should have bc in base.

cmb is a math tool. A very powerful tool that fills a myriad requirements, not to mention the following use-case which I will once-again reiterate that has once again been ignored ...

I don't care about the build option survey. It was an example.

I care about using this for combinatoric combination of ports given various options.

Let's talk about dialog4ports -- a tool that lives in ports that is required by ports. So many times this has caused me headaches and I really don't want to go down that road. It is very frustrating when you get into a situation where ports needs X from ports but you can't compile it because your base and ports frameworks have diverged so you then have to devolve into first updating your base shared Mk files relied-on by ports. It's real shit-show.

Putting cmb in base will allow it to:
a. solve math problems
b. allow me to work on base enhancements with only base (think Filesystem debugging, filesystem optimizations, memory testing, scheduler profiling, etc.)
c. be usable by ports without being stuck in the quagmire-catch-22 described above

Hi markj,

Yes, it was suggested that I make a port first.

However, everybody seems to be ignoring what I am saying regarding the utility of this tool being equitable to seq, jot, and friends.

Just yesterday, I did the following:

cmb -d / {a..z} | xargs stat 2>&1 | dpv -l -

The purpose was to generate millions of stat errors so that I could work on some filesystem patches to improve the performance of stat in nfsd.

Could I have used jot? Maybe, but it would have been more difficult.
Could I have used seq? Maybe, but same problems as jot. I needed discretely unique test matter.

I have been ignoring the suggestion to put it into ports because not one single person has even addressed the use-cases for base.

How would one make an argument for jot or seq being in base? If you pretend that they are not in base, what arguments would you use to introduce them?

cmb is kind of like that -- it has a million uses. Asking why we should have it in base is kind of like asking why we should have bc in base.

cmb is a math tool. A very powerful tool that fills a myriad requirements, not to mention the following use-case which I will once-again reiterate that has once again been ignored ...

I don't care about the build option survey. It was an example.

I care about using this for combinatoric combination of ports given various options.

Let's talk about dialog4ports -- a tool that lives in ports that is required by ports. So many times this has caused me headaches and I really don't want to go down that road. It is very frustrating when you get into a situation where ports needs X from ports but you can't compile it because your base and ports frameworks have diverged so you then have to devolve into first updating your base shared Mk files relied-on by ports. It's real shit-show.

Putting cmb in base will allow it to:
a. solve math problems
b. allow me to work on base enhancements with only base (think Filesystem debugging, filesystem optimizations, memory testing, scheduler profiling, etc.)
c. be usable by ports without being stuck in the quagmire-catch-22 described above

Regarding the prior discussion, I just see that the original thread on -announce appears to end with you saying that you would make a port. That didn't happen, and since there was some objection to putting cmb and libcmb straight into the base system, we should make sure that those objections won't be raised again after a commit. The right way to do that is to follow up on the lists, like -hackers and -arch, since only a small handful of developers are subscribed to this review.

I understand that cmb is intended to be a general-purpose utility, but that alone is not sufficient for putting it in the base system. jot(1) and seq(1) make for an interesting comparison since seq's functionality is a subset of jot's; seq was added to the base system relatively recently, specifically to make us more compatible with other unix environments. jot predates FreeBSD and comes from a time where a batteries-included approach to shipping an OS was more important than it is today.

I personally think that cmb would be an interesting addition to the base system, but there should be some concrete justification for it to be there. The build option survey is not really a compelling example. A ports target that builds a port with all possible combinations of options seems useful, but is there currently some work in progress to implement that? I understand that having the ports framework depend on a port is painful, but unless the absence of cmb from the base system is currently blocking a useful project, I don't see why this is a strong argument for not making a cmb port. You've written a number of large sh-based components of the base system - could any of them make use of cmb? My point is just that I believe that you could successfully argue for cmb's inclusion in the base system, but the idea first needs to be socialized more fully. A port is a reasonable alternative in the meantime and will make it easy for others to try using cmb.

LGTM for the man page related things.