Make eventratecheck (formerly ppsratecheck) safe on multi-processor systems
Needs ReviewPublic
Actions

Authored by jtl on Mar 2 2023, 4:40 PM.

Details

Reviewers

glebius
markj
imp

Summary

eventratecheck() (formerly ppsratecheck()) is used many places in the kernel. It currently is not multi-processor safe, which means it does not always correctly enforce the rate due to races.

Additionally, under extremely high load (greater than INT_MAX events per second), it is possible for the event counter to wrap and erroneously allow all events. While not terribly likely, this is increasingly possible with increasing performance levels.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

jtl requested review of this revision.Mar 2 2023, 4:40 PM

jtl created this revision.

atomics already make it probably too expensive to use

if you want to make the code mpsafe, you should it make it per-cpu with understanding that accuracy of rate checking will be limited. but that's a significant churn and i don't know if worth it right now

In D38853#884557, @mjg wrote:

atomics already make it probably too expensive to use

Do you have evidence to back this up? An atomic instruction may be more expensive than an unlocked instruction, but what makes it "too expensive to use" in this context?

Quite frankly I'm confused how come people keep claiming atomic ops are cheap, when it seemed it is common knowledge they are very expensive, especially so on amd64. Maybe I should do a writeup with real numbers(tm). That aside,

If you have enough traffic hitting the routine for the smp issue to be relevant, then cache-line ping pong already suffered by the routine will be amplified by the use of atomic ops used to make sure the updates are lossless. If you have only one cpu doing any work here, execution time of the routine will easily go up by at least 50%. This reason alone makes it a bad choice.

I agree the unintentionally lossy nature of the current code is a bug, I just don't think the supposed ratelimit is really expected to be very strict. A variant with some leniency (if you will) where limiting is per-cpu only will probably be good enough to avoid flooding anything. If not, it can be patched to resort to synchronized operation after some number of hits.

In D38853#884786, @mjg wrote:

Quite frankly I'm confused how come people keep claiming atomic ops are cheap, when it seemed it is common knowledge they are very expensive, especially so on amd64. Maybe I should do a writeup with real numbers(tm).

I made no such claim. In fact, I acknowledged that atomic instructions may be more expensive than unlocked operations. What I challenged was the fact that they were "too expensive" for this particular use case.

That aside,

If you have enough traffic hitting the routine for the smp issue to be relevant, then cache-line ping pong already suffered by the routine will be amplified by the use of atomic ops used to make sure the updates are lossless. If you have only one cpu doing any work here, execution time of the routine will easily go up by at least 50%. This reason alone makes it a bad choice.

At the micro-level, the cost of this function will probably rise (with the exact cost depending on the system architecture, number of NUMA domains, number of processors, how often the operations are contended, etc.). But, even with that rise, I suspect (although have no data to prove) that incrementing the atomic counter should be no more expensive than acquiring an uncontested spin lock which is cacheline-aliased with some highly-updated piece of memory. So, I think it is worth putting the expense in perspective. (But, you are welcome to disagree if you think that I am understating the expense.)

However, at the macro level, there is a real cost to inaccuracy, as well. We use this to rate limit doing additional work, and that additional work has cost. With the current racy counter increment, we can lose events (thread 1 loads the counter, thread 2 loads the counter, thread 1 locally increments the counter, thread 2 locally increments the counter, thread 1 stores the counter, thread 2 stores the counter: end result is counter += 1, rather than counter += 2). Doing extra work just a few times because our rate limit allowed additional events may well obliterate all the savings to be had from maintaining the lossy counter in the first place.

I agree the unintentionally lossy nature of the current code is a bug, I just don't think the supposed ratelimit is really expected to be very strict. A variant with some leniency (if you will) where limiting is per-cpu only will probably be good enough to avoid flooding anything. If not, it can be patched to resort to synchronized operation after some number of hits.

Because this is an old and widely used KPI, I'm not sure it is feasible to update this function so it behaves very differently from the way it does today. Several years ago, we added counter_ratecheck(9) to provide a new mechanism for doing rate-limiting without atomics. What you are suggesting appears to be a variation on that KPI. I am very supportive of improving that KPI and/or increasing conversion of performance-critical paths to it.

What I am suggesting is that we fix the widely-used eventratecheck() KPI in the meantime so it functions correctly even if that incurs additional overhead, and then work to move things in performance-critical paths to something more performant, as needed.

I suppose an alternative path forward is to codify the current behavior by putting warnings on the functions that they are lossy and inexact. I personally think we should fix the function, but I'm open to arguments to the contrary.

Again, you are welcome to disagree with me if you think I'm missing important points or not considering important factors.

markj added inline comments.Mar 4 2023, 4:19 PM

sys/kern/kern_time.c
1146	Should this be atomic_store_rel_int()? Where is the corresponding acquire barrier?
1156	For this to prevent an atomic operation, we would need INT_MAX events in a second, right? I think that's quite unlikely to happen. Why can't we stop incrementing once maxeps is reached?

jtl added inline comments.Mar 4 2023, 6:15 PM

sys/kern/kern_time.c
1146	Yes, it absolutely should be! Hopefully, I would have caught that in my pre-commit testing. But, you are absolutely correct and I'll fix that. With respect to the acquire barrier, it is nominally the atomic_cmpset_acq_time() on the previous line. However, I don't really think we really need acquire/release semantics and I would be happy to drop them. Let me know if you have a strong opinion.
1156	The reason I did this is because I don't know if anyone relies on the current behavior. Right now, you could theoretically do something like this: int limit1 = 10; int limit2 = 20; int count; struct timeval tv; switch (event) { case EVENT1: if (eventratecheck(&tv, &count, limit1)) { /* Do something. / } break; case EVENT2: if (eventratecheck(&tv, &count, limit2)) { / Do something. / } break; } Now, I don't know why you would do that, but it is theoretically a possibility. If we are willing to break any use case that relies on the current behavior of continuing to update the count even once the limit has been reached (but without changing the meaning of the arguments), we could instead switch to a countdown behavior: if (maxeps <= 0) return (maxeps != 0); if ((last == 0 \|\| (u_int)(now - last) >= hz) && atomic_cmpset_acq_time(&lasttime->tv_sec, last, now)) { atomic_store_rel_int(cureps, maxeps - 1); return (maxeps != 0); } if (cureps <= 0) return (0); count = atomic_fetchadd_int(cureps, -1); return (count > 0); This would have the downside of it taking one full second to recognize changes in the limit. But, I would hope that is acceptable. I greatly prefer this second formulation, but wasn't sure we could make that change. If you think we can make such a change, I would be open to instead proposing this.

zlei added a subscriber: zlei.Mar 6 2023, 1:57 PM

afedorov added a subscriber: afedorov.Mar 7 2023, 12:50 PM

markj added inline comments.Mar 8 2023, 7:41 PM

share/man/man9/ratecheck.9
92	or anything that helps contrast with the statement about ratecheck().
sys/kern/kern_time.c
1146	I think you're right that we don't need any special memory ordering here. But I don't have a strong opinion. :)
1156	I went through the callers of ppsratecheck() in the tree. The ones that share state among multiple calls to ppsratecheck() all use the same limit, which I believe means that your proposed change won't break them. It seems like a reasonable constraint on the interface.

So the more I look at it the more I'm convinced the current API is just a holdover from old times(tm) and needs to be reworked.

Vast majority of all consumers use rate of 1, as in once per second. And even then this is for some printf-ing purposes, so does not have to be strictly accurate.

In order to support that accurately and cheaply enough one only needs to store 'time' and cmpxchg on it, only if it differs. Thus no matter the flood, there would be atomic ops at most once per second. This is still not ideal if there *is* flood, but beats rolling with fetchadd every time.

Also note maxeps <= 0 should be considered a programming error.

I found some consumers which pass non-1 rate. That's not good, but they can be handled separately.

The way I see it, the key for now is to provide an API which does not grab wtf arguments and fits the needs of actual consumers. To that end:

struct goodname_goes_here {
        int time;
};
bool goodname_goes_here_too(struct goodname_goes_here *)

where 'true' means do the thing, 'false' means don't

In D38853#888300, @mjg wrote:

So the more I look at it the more I'm convinced the current API is just a holdover from old times(tm) and needs to be reworked.

Vast majority of all consumers use rate of 1, as in once per second. And even then this is for some printf-ing purposes, so does not have to be strictly accurate.

In order to support that accurately and cheaply enough one only needs to store 'time' and cmpxchg on it, only if it differs. Thus no matter the flood, there would be atomic ops at most once per second. This is still not ideal if there *is* flood, but beats rolling with fetchadd every time.

That is what Jonathan's follow-up suggestion accomplishes, and it doesn't have the limitation of requiring a rate of 1/s.

In D38853#888922, @markj wrote:

In D38853#888300, @mjg wrote:

So the more I look at it the more I'm convinced the current API is just a holdover from old times(tm) and needs to be reworked.

Vast majority of all consumers use rate of 1, as in once per second. And even then this is for some printf-ing purposes, so does not have to be strictly accurate.

In order to support that accurately and cheaply enough one only needs to store 'time' and cmpxchg on it, only if it differs. Thus no matter the flood, there would be atomic ops at most once per second. This is still not ideal if there *is* flood, but beats rolling with fetchadd every time.

That is what Jonathan's follow-up suggestion accomplishes, and it doesn't have the limitation of requiring a rate of 1/s.

With my proposal there is *no* fetchadd for the 1 per second rate. There is at most once-per-second-per-each-cpu cmpxchg, which is far from ideal but probably tolerable.

The patch still resorts to fetchadd to bump cureps if cmpxchg failed and there was no overflow and the cpu raised past cureps store. This is not fixable in the current API as different consumers can in principle pass different maxeps, so you sitll need to track occurences.

A dedicated API to do it "at most once per second (or so)" is clearly simpler and sufficient for almost all consumers. It can also do away with exposing implementation details into said consumers by wrapping args in a struct.