x86: restore intr_bind
AbandonedPublic
Actions

Authored by emaste on Aug 14 2023, 2:21 PM.

Details

Reviewers

Summary

It broke interrupt routing on some test hardware.

This reverts commit eee6537665ae9830e831d7473dedd6b2cc81c2ea.
This reverts commit 2bb16c6352494bf7aba92be700908ddf01e0c08b.

Sponsored by:   The FreeBSD Foundation

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

emaste requested review of this revision.Aug 14 2023, 2:21 PM

emaste created this revision.

It'd be useful to know where to start looking for the root cause. We have nexus_bind_intr() -> _intr_event_bind() -> ie_assign_cpu, and I think ie_assign_cpu will be intr_assign_cpu() on amd64, and the problem is likely in msi_assign_cpu()?

eee6537665ae9830e831d7473dedd6b2cc81c2ea and 2bb16c6352494bf7aba92be700908ddf01e0c08b are correct/required; this revert "works" because it truncates to CPU<256; adding cpu &= 0xff "works" equally well.

I'm unsure whether it works well as part of any existing differential, but it was suggested there should be a cpu_t type definition. That didn't fit as part of D36901, but does seem worthwhile. That would likely address the breakage which was found.

In D41447#944002, @ehem_freebsd_m5p.com wrote:

I'm unsure whether it works well as part of any existing differential, but it was suggested there should be a cpu_t type definition. That didn't fit as part of D36901, but does seem worthwhile. That would likely address the breakage which was found.

I think we should indeed have a cpu_t definition, but it would not help the issue here.

For context, I have a system with 512 CPUs. sys/dev/nvme/nvme_ctrlr.c does:

for (i = c = n = 0; i < ctrlr->num_io_queues; i++, c += n) {
        qpair = &ctrlr->ioq[i];

        /*
         * Admin queue has ID=0. IO queues start at ID=1 -
         *  hence the 'i+1' here.
         */
        qpair->id = i + 1;
        if (ctrlr->num_io_queues > 1) {
                /* Find number of CPUs served by this queue. */
                for (n = 1; QP(ctrlr, c + n) == i; n++)
                        ;
                /* Shuffle multiple NVMe devices between CPUs. */
                qpair->cpu = c + (device_get_unit(ctrlr->dev)+n/2) % n;
                qpair->domain = pcpu_find(qpair->cpu)->pc_domain;
        } else {
                qpair->cpu = CPU_FFS(&cpuset_domain[ctrlr->domain]) - 1;
                qpair->domain = ctrlr->domain;
        }

        /*
         * For I/O queues, use the controller-wide max_xfer_size
         *  calculated in nvme_attach().
         */
        error = nvme_qpair_construct(qpair, num_entries, num_trackers,
            ctrlr);
        if (error)
                return (error);

        /*
         * Do not bother binding interrupts if we only have one I/O
         *  interrupt thread for this controller.
         */
        if (ctrlr->num_io_queues > 1)
                bus_bind_intr(ctrlr->dev, qpair->res, qpair->cpu);
}

that is, it has its own logic to choose a CPU for each qpair if more than one, and it ends up choosing CPUs with APIC ID > 255 for some number of qpairs. prior to 2bb16c6352494bf7aba92be700908ddf01e0c08b (or with my cpu &= 0xff hack) we end up binding to a cpu < 256 despite nvme's intent, and interrupts work by being handled by a CPU other than what nvme requested.

With 2bb16c635249 we no longer truncate the value and then fail in msi_map if the APIC ID > 255 (if we don't have an iommu)