Page MenuHomeFreeBSD

SMP support for ARMv6/v7 HW watchpoints
ClosedPublic

Authored by zbb on Dec 1 2015, 7:26 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Nov 13, 12:31 PM
Unknown Object (File)
Tue, Nov 5, 4:54 AM
Unknown Object (File)
Oct 22 2024, 4:23 AM
Unknown Object (File)
Oct 16 2024, 2:05 AM
Unknown Object (File)
Oct 16 2024, 1:27 AM
Unknown Object (File)
Oct 16 2024, 1:27 AM
Unknown Object (File)
Oct 16 2024, 1:26 AM
Unknown Object (File)
Oct 16 2024, 1:26 AM
Subscribers

Details

Summary

Use per-CPU structure to store HW watchpoints registers state
for each CPU present in the system. Those registers will be restored
upon wake up from the STOP state if requested by the debug_monitor
code. The method is similar to the one introduced to AMD64.

We store all possible 16 registers for HW watchpoints
(maximum allowed by the architecture).
HW breakpoints are not maintained since they are used for single
stepping only.

Pointed out by: kib
Reviewed by:
Submitted by: Zbigniew Bodek <zbb@semihalf.com>
Obtained from: Semihalf
Sponsored by: Juniper Networks Inc.
Differential Revision:

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

zbb retitled this revision from to SMP support for ARMv6/v7 HW watchpoints.
zbb updated this object.
zbb edited the test plan for this revision. (Show Details)
zbb added reviewers: kib, ian, imp, skra, wma_semihalf.com.
zbb set the repository for this revision to rS FreeBSD src repository - subversion.
zbb added a subscriber: ARM.
sys/arm/arm/debug_monitor.c
709 ↗(On Diff #10647)

What is the purpose of this barrier ?

731 ↗(On Diff #10647)

This is better expressed with atomic_thread_fence_rel().

779 ↗(On Diff #10647)

Same questions as above.

1025 ↗(On Diff #10647)

As I understand, this barrier complements the _rel() in bp setup code, and should be better expressed as atomic_thread_fence_acq().

@kib
After some more investigation I think that those barriers should be removed and ought to be replaced by the dsb() barrier at the end of:

dbg_setup_xpoint()

and

dbg_remove_xpoint()

This is of course if we assume that only one CPU is executing DDB code at a time and that other CPUs spin while waiting to be released.
In that case we only need to ensure that all modifications to the shared variable will be done before we kick secondary CPUs.
Do you find that reasonable?

In D4338#91815, @zbb wrote:

@kib
After some more investigation I think that those barriers should be removed and ought to be replaced by the dsb() barrier at the end of:

dbg_setup_xpoint()

and

dbg_remove_xpoint()

This is of course if we assume that only one CPU is executing DDB code at a time and that other CPUs spin while waiting to be released.
In that case we only need to ensure that all modifications to the shared variable will be done before we kick secondary CPUs.
Do you find that reasonable?

I do not quite see why would it be enough. I am looking at the IPI_STOP case of the ipi_handler(). There, we loop waiting for the flag to be set which enables use to run. There is nothing which would prevent reordering of writes (watchpoint data snapshot vs. the run flag in started_cpus) either on write side (from CPU which entered DDB) or on read side (for CPU which is spinning).

In other words, I think that the places where I suggest using atomic_thread_fences do need barriers, while I still do not see a purpose of rmb()s at the write side.

In D4338#92041, @kib wrote:
In D4338#91815, @zbb wrote:

@kib
After some more investigation I think that those barriers should be removed and ought to be replaced by the dsb() barrier at the end of:

dbg_setup_xpoint()

and

dbg_remove_xpoint()

This is of course if we assume that only one CPU is executing DDB code at a time and that other CPUs spin while waiting to be released.
In that case we only need to ensure that all modifications to the shared variable will be done before we kick secondary CPUs.
Do you find that reasonable?

I do not quite see why would it be enough. I am looking at the IPI_STOP case of the ipi_handler(). There, we loop waiting for the flag to be set which enables use to run. There is nothing which would prevent reordering of writes (watchpoint data snapshot vs. the run flag in started_cpus) either on write side (from CPU which entered DDB) or on read side (for CPU which is spinning).

In other words, I think that the places where I suggest using atomic_thread_fences do need barriers, while I still do not see a purpose of rmb()s at the write side.

  1. CPUx enters DDB and puts other CPUs to spin
  2. CPUx does something for example sets watchpoint (for itself and other CPUs)
  3. CPUx invokes DSB that ensures that all writes are completed before issuing further instructions
  4. CPUx sets flag that will release secondary CPUs from spin.

If we assume that only one CPU is on-line while other spin and that all of them are in the inner shareable domain I don't see the need for other barriers than DSB before waking secondary CPUs, thus after modifying shared data that may be the last command from DDB before waking CPUs.

In D4338#93177, @zbb wrote:
  1. CPUx enters DDB and puts other CPUs to spin
  2. CPUx does something for example sets watchpoint (for itself and other CPUs)
  3. CPUx invokes DSB that ensures that all writes are completed before issuing further instructions
  4. CPUx sets flag that will release secondary CPUs from spin.

If we assume that only one CPU is on-line while other spin and that all of them are in the inner shareable domain I don't see the need for other barriers than DSB before waking secondary CPUs, thus after modifying shared data that may be the last command from DDB before waking CPUs.

What you do is exactly the classic 'message passing' litmus test. See section 3 in the https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf and D7.2.2 Message passing in the ARM ARMv7.

zbb edited edge metadata.
zbb removed rS FreeBSD src repository - subversion as the repository for this revision.
In D4338#93181, @kib wrote:
In D4338#93177, @zbb wrote:
  1. CPUx enters DDB and puts other CPUs to spin
  2. CPUx does something for example sets watchpoint (for itself and other CPUs)
  3. CPUx invokes DSB that ensures that all writes are completed before issuing further instructions
  4. CPUx sets flag that will release secondary CPUs from spin.

If we assume that only one CPU is on-line while other spin and that all of them are in the inner shareable domain I don't see the need for other barriers than DSB before waking secondary CPUs, thus after modifying shared data that may be the last command from DDB before waking CPUs.

What you do is exactly the classic 'message passing' litmus test. See section 3 in the https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf and D7.2.2 Message passing in the ARM ARMv7.

Thanks for the document!
Sorry for a huge delay. I completely forgot about that one.

If there are no further remarks I will commit it today.

sys/arm/arm/debug_monitor.c
809 ↗(On Diff #12695)

Hmm, I think what you want there is to ensure that pc_dbreg_cmd write is seen only after the dbg_w* writes. In other words, fence_rel() should be between these writes, or use store_rel() for write to pc_dbreg_cmd.

1027 ↗(On Diff #12695)

And there, fence_acq() should come between read of dbreg_cmd and further accesses. Or use load_acq() to read dbreg_cmd.

zbb marked 2 inline comments as done.Jan 26 2016, 2:20 PM
zbb added inline comments.
sys/arm/arm/debug_monitor.c
809 ↗(On Diff #12695)

I see it as follows:

  1. Update PCPU data for each CPU (including pc_dbreg_cmd)
  2. DMB (ensures that all explicit memory accesses that appear in program order before the DMB instruction are observed before any explicit memory accesses that appear in program order after the DMB instruction)
  3. Wake CPUs from spin (update started_cpus)

There should be no race between 1. and 3. since CPUs spin on started_cpus that is not going to be updated before any of 1. is observable.
So if we add DMB between writes to d->dbg_w* and pcpu->pc_dbreg_cmd and DMB after the whole thing, the result will be the same but we will have one barrier more on each loop.

wma added inline comments.
sys/arm/arm/debug_monitor.c
809 ↗(On Diff #12695)

I don't see why it is necessary and suggest avoiding unnecessary barriers.

All this code executes in KDB, that means that only one core is active and all others are spinning on a variable.
In this case we've got:

THREAD1:

  1. setup pcpu/watchpoints db_reg and data_reg
  2. invoke write barrier (all writes are finished before going further)
  3. wakeup other cpus by modifying the spinning variable

THREAD2

  1. spinning on the variable
  2. waking up
  3. calling read barrier (all preloaded reads are discarded from the buffers)
  4. executing command from db_reg and data_reg

I just understand what can go wrong here, we only need to guarantee that cmd_reg and data_reg are written before waking up other cpus and are observable by just-woken-up thread. Any particular order of writes/reads to cmd/data reg is irrelevant.

809 ↗(On Diff #12695)

should be: "I just don;t understand what can go wrong here,.."

wma added a reviewer: wma.
This revision is now accepted and ready to land.Jan 28 2016, 12:01 PM
This revision was automatically updated to reflect the committed changes.

Committed.
@kib I hope you don't mind I've sent "no strong objections" from you.