Page MenuHomeFreeBSD

Fix early epoch calls
ClosedPublic

Authored by jtl on Oct 10 2018, 4:32 AM.
Tags
None
Referenced Files
F83157539: D17492.id.diff
Tue, May 7, 3:26 AM
Unknown Object (File)
Wed, May 1, 2:57 PM
Unknown Object (File)
Tue, Apr 30, 11:13 AM
Unknown Object (File)
Apr 7 2024, 6:15 AM
Unknown Object (File)
Mar 4 2024, 8:11 AM
Unknown Object (File)
Dec 20 2023, 6:09 AM
Unknown Object (File)
Dec 9 2023, 1:54 PM
Unknown Object (File)
Dec 7 2023, 1:42 AM

Details

Summary

The epoch(9) subsystem is supposed to guarantee that deferred functions scheduled with epoch_call(9) will not run until all threads currently in a corresponding epoch section have exited that section. Likewise, epoch_wait(9) is not supposed to return until all threads currently in a corresponding epoch section have exited that section. To enforce these guarantees, the epoch subsystem tracks both an increasing global epoch value and also a per-CPU epoch value which indicates the global epoch in which the "oldest" active thread first entered the epoch section. (Here, "oldest" simply indicates how long the thread has been in the epoch section.)

Due to a logic error in ck_epoch_poll_deferred(), these guarantees can be violated. When ck_epoch_scan() indicates that a thread is still in a previous epoch section, ck_epoch_poll_deferred() resets the epoch value for the current CPU to the global epoch value. This causes it to appear that the "oldest" active thread first entered the epoch section more recently than it did. That allows epoch_call() functions to run sooner than they should, and allows epoch_wait() to return before all threads currently in a corresponding epoch section have exited that section.

Correct this error by not advancing the per-CPU epoch value in ck_epoch_poll_deferred().

Test Plan

I developed a kernel module to test my theory that epoch calls could fire early. It showed that they did. These code changes fix that.

I also used a dtrace script to confirm that the net_epoch_preempt global epoch value was still advancing and we were running epoch calls.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
No Lint Coverage
Unit
No Test Coverage
Build Status
Buildable 20113
Build 19607: arc lint + arc unit

Event Timeline

Split the commit into two parts. In this part, just fix the early epoch calls.

jtl edited the test plan for this revision. (Show Details)
jtl added a reviewer: sbahra_repnop.org.

The other half of the original diff is now in D17503.

Thanks much for this. Note that the assumption here is a writer thread *should not* be calling ck_epoch_synchronize or write while inside of a read-side critical section. In other words, the active bit must be set to 0. In this situation, the invariant is not violated. If inside a read-side critical section, bets are off. Is the assumption or requirement here that it be safe to call synchronize from with-in a read-side critical section? If so, there are a few other areas we could fix up.

Thanks much for this. Note that the assumption here is a writer thread *should not* be calling ck_epoch_synchronize or write while inside of a read-side critical section. In other words, the active bit must be set to 0. In this situation, the invariant is not violated. If inside a read-side critical section, bets are off. Is the assumption or requirement here that it be safe to call synchronize from with-in a read-side critical section? If so, there are a few other areas we could fix up.

I'm not sure I fully understand the question. But, let me state the way this is used in the FreeBSD kernel.

We create one struct ck_epoch_record per CPU per struct ck_epoch. When a thread enters a read-side section, we block that thread from migrating off the CPU and set the active bit for that CPU's struct ck_epoch_record (if not already set). However, that thread is still preempt-able. So, another thread can preempt the first thread and concurrently enter a read-side section. And, a third thread can preempt that one and write to the protected structures while the first two are still in the read section.

Also, we call ck_epoch_poll_deferred() approximately once every "tick" for each struct ck_epoch_record on each CPU that has pending events. So, it is also possible to have this sequence occur on a single CPU:

  1. Thread A writes and runs ck_epoch_call().
  2. Thread B enters a read section.
  3. The once-a-tick handler calls and runs ck_epoch_poll_deferred().
  4. Thread B exits the read section.

We don't allow the kernel to call ck_epoch_synchronize(), ck_epoch_barrier(), or ck_epoch_barrier_wait(). Instead, barriers are handled by some custom code which uses ck_epoch_synchronize_wait() and the global epoch. That code asserts (at least in one place) that the caller is not in a read section.

So, it seems like we may violate the reader/writer assumption only in calling ck_epoch_poll_deferred() while there are active readers for the same record?

Edited to add:
For barriers, we do assert that the caller is not in a read section. However, the CPU's record may still be in a read section. So, maybe we do violate the reader/writer assumption there, too.

This revision is now accepted and ready to land.Oct 10 2018, 7:15 PM

The above change should not have any impact if ck_epoch is being correctly used. For that reason, it's good to go. However, I am concerned about a larger problem.

For barriers, we do assert that the caller is not in a read section. However, the CPU's record may still be in a read section. So, maybe we do violate the reader/writer assumption there, too.

As long as the CPU record does not require re-entrance support, that's fine. The requirement is that the record that is used to issue a write-side request is not in a read-side section. We can lift this restriction, but I'll need to validate a few things. Is the kernel violating this restriction at the moment? I had trouble understanding that from the example given, because if the tick handler has a dedicated record, there is no problem and then the above change is a no-op.

If the above constraint is being violated, then I'll get back to you tonight as I'll want to do some additional validation.

The above change should not have any impact if ck_epoch is being correctly used. For that reason, it's good to go. However, I am concerned about a larger problem.

For barriers, we do assert that the caller is not in a read section. However, the CPU's record may still be in a read section. So, maybe we do violate the reader/writer assumption there, too.

As long as the CPU record does not require re-entrance support, that's fine. The requirement is that the record that is used to issue a write-side request is not in a read-side section. We can lift this restriction, but I'll need to validate a few things. Is the kernel violating this restriction at the moment? I had trouble understanding that from the example given, because if the tick handler has a dedicated record, there is no problem and then the above change is a no-op.

If the above constraint is being violated, then I'll get back to you tonight as I'll want to do some additional validation.

The only records are CPU records. Everything running on that CPU (including the tick handler) shares that same record. So, it is entirely possible for ck_epoch_call() and ck_epoch_poll_deferred() to be called using a record which is currently in a read section.

The wait barriers are a little more nuanced. They only call into the CK code using ck_epoch_synchronize_wait() and the global epoch. Nothing uses ck_epoch_synchronize(), ck_epoch_barrier(), or ck_epoch_barrier_wait(). Therefore, CK won't directly see a call to one of these synchronization functions which has a CPU record as an argument.

Does that explain things?

Ok. I caught up with Matt offline. This doesn't solve the problem at hand. The API is not being used as intended in some areas, so we will work to address that. I would advise against merging this in as it doesn't address the underlying issues:

  • Some usage of epoch records is re-entrant: This is not supported. Supporting this would require atomic usage that would be, expensive. We have an easy fix for that.
  • Handling of ck_epoch_call across multiple writers: If writers are globally serialized, then this is fine. There is more magic involved if you have multiple writers mutating the epoch while concurrent ck_epoch_call invocations are made outside of an active section.

    I will provide a detailed proposal tonight and provide more color on the above. I was caught on a particularly bad day. We'll work to get these merged in as soon as possible upstream with sufficient tests. Both of these should be easy fixes.

@jtl What timelines are you working with? It'll help me prioritize my time on this.

This revision now requires changes to proceed.Oct 10 2018, 9:10 PM

Ok. I caught up with Matt offline. This doesn't solve the problem at hand. The API is not being used as intended in some areas, so we will work to address that. I would advise against merging this in as it doesn't address the underlying issues:

  • Some usage of epoch records is re-entrant: This is not supported. Supporting this would require atomic usage that would be, expensive. We have an easy fix for that.

Just to be clear, I think most of the actual epoch record manipulation is done in critical sections, which are not subject to preemption. (Well, technically, hardware interrupts can preempt this code, but we shouldn't be calling into the epoch code from a hardware interrupt.) The one exception to this might be the calls to ck_epoch_synchronize_wait(). But, it is certainly the case that the same epoch record can be in a read section when a write call is made, so it is re-entrant in that sense.

(However, @mmacy knows this code better than me. So, he is the right person to consult about this.)

  • Handling of ck_epoch_call across multiple writers: If writers are globally serialized, then this is fine. There is more magic involved if you have multiple writers mutating the epoch while concurrent ck_epoch_call invocations are made outside of an active section.

    I will provide a detailed proposal tonight and provide more color on the above. I was caught on a particularly bad day. We'll work to get these merged in as soon as possible upstream with sufficient tests. Both of these should be easy fixes.

Thanks! I look forward to seeing the fixes.

@jtl What timelines are you working with? It'll help me prioritize my time on this.

With my FreeBSD project hat on, we are releasing FreeBSD 12 and I'd like to see this fixed soon. That will let us see whether this fixes the unexplained panics users have reported. The sooner the code gets into the tree, the sooner we can see which problems this fixes and which require further investigation.

In my day job, we are hoping to release a new version of an OS based on FreeBSD in the next week, and I'd like to have this fixed for that release. If we have a fix by Friday, we may have a shot at getting it into our internal release. But, its much more important that the code be right than that it meet that artificial timeline.

This is actually good too! Sorry for the noise. It was an API violation to call poll on a record that is in a read section, but it seems there is no reason to have that API restriction to begin with at this point. The API restriction is removed now.

In cases where ck_epoch_section is used, this is also safe. Everything is synchronized with the global counter. I merged a variant upstream.

I have one more concern around ck_epoch_call usage, just validating that now with Matt.

This revision is now accepted and ready to land.Oct 14 2018, 2:06 AM

Ok, call should be fine as well! In all cases, seems like writers are synchronized and any other usage is in the context of a protected section. CI also passes (though I haven't run it through SPARCv9+ and ARM, but we already have RMO and TSO coverage, nothing specific to ARM / SPARC with these changes).

Thanks again Jon. These are great.

Matt should be merging down soon from upstream.

Ok, call should be fine as well! In all cases, seems like writers are synchronized and any other usage is in the context of a protected section. CI also passes (though I haven't run it through SPARCv9+ and ARM, but we already have RMO and TSO coverage, nothing specific to ARM / SPARC with these changes).

Thanks again Jon. These are great.

Thanks for the quick turnaround on reviewing and fixing these upstream!

The new code was merged in rS339375.