Page MenuHomeFreeBSD

ixgbe and igb should check tunables at load time. Also support per-device queue count.
AbandonedPublic

Authored by alfred on Nov 12 2014, 7:26 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Jun 12, 9:55 PM
Unknown Object (File)
Wed, Jun 12, 9:55 PM
Unknown Object (File)
May 27 2024, 4:45 AM
Unknown Object (File)
May 19 2024, 9:54 AM
Unknown Object (File)
Apr 22 2024, 10:47 PM
Unknown Object (File)
Apr 13 2024, 10:16 PM
Unknown Object (File)
Mar 20 2024, 1:31 AM
Unknown Object (File)
Feb 16 2024, 8:11 AM

Details

Reviewers
erj
glebius
adrian
smh
Group Reviewers
network
Summary

We tune the number of queues for ixgbe and igb based on mp_ncpus using a kernel module.

The problem we have is that both drivers fetch the value via TUNABLE_INT() in driver global context.

Unfortunately this causes the tunable to be fetched at SI_SUB_TUNABLES which is one of the first SYSINITs run, this is BEFORE SI_SUB_CPU which is what sets mp_ncpus.

This fix causes the value for number of queues to be read at driver attach time.

In addition each device will check for "hw.igb.%d.num_queues" for each interface to set the number of queues for the NIC allowing each NIC to have different queue count.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

alfred retitled this revision from to ixgbe and igb should check tunables at load time. Also support per-device queue count..
alfred updated this object.
alfred edited the test plan for this revision. (Show Details)
alfred added reviewers: glebius, adrian, smh.
alfred set the repository for this revision to rS FreeBSD src repository - subversion.
alfred added a reviewer: network.

You talk about having to workaround mp_ncpus not being available when TUNABLE_INT is called but that doesn't seem like a problem as the values read are an override and hence don't depend on mp_ncpus being available. Am I miss-understand what you're trying to fix?

In xxx_setup_msix you're adding per device specific tunables but you haven't added the corresponding sysctl for them so they will be invisible to the user, something of a pet hate of mine ;-)

Unrelated but also I noticed while looking at this that enable_aim in ixgbe is broken, as the global and device setting reference the same variable.

bryanv added inline comments.
sys/dev/e1000/if_igb.c
2880

DragonflyBSD has device_getenv_int() that does something similar - device specific query with fallback to a global value. Maybe it is a good idea to add a similarly named function so it can be available to all drivers?

In D1149#5, @smh wrote:

You talk about having to workaround mp_ncpus not being available when TUNABLE_INT is called but that doesn't seem like a problem as the values read are an override and hence don't depend on mp_ncpus being available. Am I miss-understand what you're trying to fix?

Have a look at the following kernel module. What it does it tune number of queues based on CPUs. The problem is that TUNABLEs are SI_ORDER_TUNABLE, if you look at kernel.h you'll see that happens before SI_ORDER_CPU.

Basically my module below uses setenv(9) to set the tunable, but by then it is too late. This fix makes the driver wait until actual probe/attach time to fetch from the environment allowing my hooks to tune properly.

In xxx_setup_msix you're adding per device specific tunables but you haven't added the corresponding sysctl for them so they will be invisible to the user, something of a pet hate of mine ;-)

Understandable, I can remove them... or is there a trick to making them per-device without a lot of fuss?

Unrelated but also I noticed while looking at this that enable_aim in ixgbe is broken, as the global and device setting reference the same variable.

Does my patch break it? Or is it already broken?

In D1149#6, @bryanv wrote:

DragonflyBSD has device_getenv_int() that does something similar - device specific query with fallback to a global value. Maybe it is a good idea to add a similarly named function so it can be available to all drivers?

That would be pretty easy, code is simple:

int
device_getenv_int(device_t dev, const char *knob, int def)
{
        char env[128];

        ksnprintf(env, sizeof(env), "hw.%s.%s", device_get_nameunit(dev), knob);
        kgetenv_int(env, &def);
        return def;
}

Do we want hw.igb0.num_queues or hw.igb.0.num_queues?

I think in FreeBSD we are doing the latter (the unit number is a separate node in the mib). So for us this would be something like:

int
device_getenv_int(device_t dev, const char *knob, int def)
{
        char env[128];

        ksnprintf(env, sizeof(env), "hw.%s.%d.%s", device_get_name(dev), device_get_unit(dev), knob);
        kgetenv_int(env, &def);
        return def;
}

Or is the dfly approach correct?

In D1149#9, @alfredperlstein wrote:
In D1149#6, @bryanv wrote:

DragonflyBSD has device_getenv_int() that does something similar - device specific query with fallback to a global value. Maybe it is a good idea to add a similarly named function so it can be available to all drivers?

That would be pretty easy, code is simple:

int
device_getenv_int(device_t dev, const char *knob, int def)
{
        char env[128];

        ksnprintf(env, sizeof(env), "hw.%s.%s", device_get_nameunit(dev), knob);
        kgetenv_int(env, &def);
        return def;
}

Do we want hw.igb0.num_queues or hw.igb.0.num_queues?

I think in FreeBSD we are doing the latter (the unit number is a separate node in the mib). So for us this would be something like:

int
device_getenv_int(device_t dev, const char *knob, int def)
{
        char env[128];

        ksnprintf(env, sizeof(env), "hw.%s.%d.%s", device_get_name(dev), device_get_unit(dev), knob);
        kgetenv_int(env, &def);
        return def;
}

Or is the dfly approach correct?

The local copies of that function in the drivers I've written - vtnet_tunable_int() & vmxnet3_tunable_int() - use hw.%s.%d.%s. I think is most common in various drivers too.

In D1149#8, @alfredperlstein wrote:
In D1149#5, @smh wrote:

You talk about having to workaround mp_ncpus not being available when TUNABLE_INT is called but that doesn't seem like a problem as the values read are an override and hence don't depend on mp_ncpus being available. Am I miss-understand what you're trying to fix?

Have a look at the following kernel module. What it does it tune number of queues based on CPUs.

The problem is that TUNABLEs are SI_ORDER_TUNABLE, if you look at kernel.h you'll see that happens before SI_ORDER_CPU.

Basically my module below uses setenv(9) to set the tunable, but by then it is too late. This fix makes the driver wait until actual probe/attach time to fetch from the environment allowing my hooks to tune properly.

I'm still confused, the driver already tunes the number of queues to the number of available CPU's if you leave the tunable untouched (set to zero), during setup_msix. The tunable is only used if you want to override that auto calculation.

In xxx_setup_msix you're adding per device specific tunables but you haven't added the corresponding sysctl for them so they will be invisible to the user, something of a pet hate of mine ;-)

Understandable, I can remove them... or is there a trick to making them per-device without a lot of fuss?

See ixgbe_attach which adds device specific sysctls, but don't use ixgbe_enable_aim as an example as that's broken ;-)

Unrelated but also I noticed while looking at this that enable_aim in ixgbe is broken, as the global and device setting reference the same variable.

Does my patch break it? Or is it already broken?

Already broken, just thought I would mention it in case you used it as a basis for per device sysctls.

In D1149#9, @alfredperlstein wrote:
In D1149#6, @bryanv wrote:

DragonflyBSD has device_getenv_int() that does something similar - device specific query with fallback to a global value. Maybe it is a good idea to add a similarly named function so it can be available to all drivers?

That would be pretty easy, code is simple:

int
device_getenv_int(device_t dev, const char *knob, int def)
{
        char env[128];

        ksnprintf(env, sizeof(env), "hw.%s.%s", device_get_nameunit(dev), knob);
        kgetenv_int(env, &def);
        return def;
}

Do we want hw.igb0.num_queues or hw.igb.0.num_queues?

We currently have the former for most sysclt's with some having device specific ones too.

I think in FreeBSD we are doing the latter (the unit number is a separate node in the mib). So for us this would be something like:

int
device_getenv_int(device_t dev, const char *knob, int def)
{
        char env[128];

        ksnprintf(env, sizeof(env), "hw.%s.%d.%s", device_get_name(dev), device_get_unit(dev), knob);
        kgetenv_int(env, &def);
        return def;
}

Or is the dfly approach correct?

Already have similar via something like:

SYSCTL_ADD_PROC(device_get_sysctl_ctx(dev),
                SYSCTL_CHILDREN(device_get_sysctl_tree(dev)),
                OID_AUTO, "advertise_speed", CTLTYPE_INT | CTLFLAG_RW,
                adapter, 0, ixgbe_set_advertise, "I", "Link Speed");
In D1149#12, @smh wrote:
In D1149#8, @alfredperlstein wrote:
In D1149#5, @smh wrote:

You talk about having to workaround mp_ncpus not being available when TUNABLE_INT is called but that doesn't seem like a problem as the values read are an override and hence don't depend on mp_ncpus being available. Am I miss-understand what you're trying to fix?

Have a look at the following kernel module. What it does it tune number of queues based on CPUs.

The problem is that TUNABLEs are SI_ORDER_TUNABLE, if you look at kernel.h you'll see that happens before SI_ORDER_CPU.

Basically my module below uses setenv(9) to set the tunable, but by then it is too late. This fix makes the driver wait until actual probe/attach time to fetch from the environment allowing my hooks to tune properly.

I'm still confused, the driver already tunes the number of queues to the number of available CPU's if you leave the tunable untouched (set to zero), during setup_msix. The tunable is only used if you want to override that auto calculation.

I don't want the queues == number of cpus. I want queues to be equal to some f() that I define. Note in the example I gave it's actually 1/2 cpus.

I want to override the auto calc without having to hack the driver. I also want others to be able to do so as well.

In xxx_setup_msix you're adding per device specific tunables but you haven't added the corresponding sysctl for them so they will be invisible to the user, something of a pet hate of mine ;-)

Understandable, I can remove them... or is there a trick to making them per-device without a lot of fuss?

See ixgbe_attach which adds device specific sysctls, but don't use ixgbe_enable_aim as an example as that's broken ;-)

OK, I can try that.

Unrelated but also I noticed while looking at this that enable_aim in ixgbe is broken, as the global and device setting reference the same variable.

Does my patch break it? Or is it already broken?

Already broken, just thought I would mention it in case you used it as a basis for per device sysctls.

I can look into that next.

In D1149#13, @smh wrote:
In D1149#9, @alfredperlstein wrote:
In D1149#6, @bryanv wrote:

DragonflyBSD has device_getenv_int() that does something similar - device specific query with fallback to a global value. Maybe it is a good idea to add a similarly named function so it can be available to all drivers?

That would be pretty easy, code is simple:

int
device_getenv_int(device_t dev, const char *knob, int def)
{
        char env[128];

        ksnprintf(env, sizeof(env), "hw.%s.%s", device_get_nameunit(dev), knob);
        kgetenv_int(env, &def);
        return def;
}

Do we want hw.igb0.num_queues or hw.igb.0.num_queues?

We currently have the former for most sysclt's with some having device specific ones too.

I think in FreeBSD we are doing the latter (the unit number is a separate node in the mib). So for us this would be something like:

int
device_getenv_int(device_t dev, const char *knob, int def)
{
        char env[128];

        ksnprintf(env, sizeof(env), "hw.%s.%d.%s", device_get_name(dev), device_get_unit(dev), knob);
        kgetenv_int(env, &def);
        return def;
}

Or is the dfly approach correct?

Already have similar via something like:

SYSCTL_ADD_PROC(device_get_sysctl_ctx(dev),
                SYSCTL_CHILDREN(device_get_sysctl_tree(dev)),
                OID_AUTO, "advertise_speed", CTLTYPE_INT | CTLFLAG_RW,
                adapter, 0, ixgbe_set_advertise, "I", "Link Speed");

The helper from dfly is a ton more readable... yowch, ok maybe we can make a macro for that to clean it up?

In D1149#14, @alfredperlstein wrote:

I don't want the queues == number of cpus. I want queues to be equal to some f() that I define. Note in the example I gave it's actually 1/2 cpus.

I want to override the auto calc without having to hack the driver. I also want others to be able to do so as well.

Ahh so because your using setenv it needs to be done before the tunable is initialised which currently happens during SI_SUB_TUNABLES which is before SI_SUB_CPU hence too early as you can't calculate a value based on mp_ncpus.

Might be a silly question; why don't you use kernel_sysctlbyname instead, which should just work without any changes?

In D1149#16, @smh wrote:
In D1149#14, @alfredperlstein wrote:

I don't want the queues == number of cpus. I want queues to be equal to some f() that I define. Note in the example I gave it's actually 1/2 cpus.

I want to override the auto calc without having to hack the driver. I also want others to be able to do so as well.

Ahh so because your using setenv it needs to be done before the tunable is initialised which currently happens during SI_SUB_TUNABLES which is before SI_SUB_CPU hence too early as you can't calculate a value based on mp_ncpus.

Exactly.

Might be a silly question; why don't you use kernel_sysctlbyname instead, which should just work without any changes?

It is RDTUNE which means !CTLFLAG_WR, which means EPERM:

from kern_sysctl.c:

static int
sysctl_root(SYSCTL_HANDLER_ARGS)
{


      struct sysctl_oid *oid;
      int error, indx, lvl;

      SYSCTL_ASSERT_SLOCKED();

      error = sysctl_find_oid(arg1, arg2, &oid, &indx, req);
      if (error)
              return (error);

      if ((oid->oid_kind & CTLTYPE) == CTLTYPE_NODE) {
              /*
               * You can't call a sysctl when it's a node, but has
               * no handler.  Inform the user that it's a node.
               * The indx may or may not be the same as namelen.
               */
              if (oid->oid_handler == NULL)
                      return (EISDIR);
      }

      /* Is this sysctl writable? */
      if (req->newptr && !(oid->oid_kind & CTLFLAG_WR))
              return (EPERM);
alfred edited edge metadata.
  • Add a version of device_getenv_int from dfly.
  • Use device_getenv_int(9).
  • sysctl export num_queues for ixgbe and igb.

OK, code is updated.

I'm testing now, but the general gist is:

  1. We have sysctl nodes for per device num_queues being exported.
  2. We have device_getenv_int() now to clean things up. This api is slightly different from dfly so you can tell when it fails more easily.

Quick question @smh...

should the per-device tunable be "hw.igb.0.num_queues" or "dev.igb.0.num_queues"???

Thank you for looking into this! Can you please CC jfv on the review (even if he doesn't have a phabricator account yet)?

sys/dev/e1000/if_igb.c
497

Is this Phabricator whitespace wonkiness, or...?

2870–2871
  • style(9) braces.
  • should the device_printf be printed all the time?
2877–2878

style(9): braces

2887–2888

style(9): braces...

2919–2920

style(9): braces can be removed

sys/dev/ixgbe/ixgbe.c
106

Is there a hard tab here aligned with the other prototypes, or is it phabricator being silly with whitespace?

2525

Could this be renamed to num_queues for consistency with the sysctl?

2558

Reformat to 80 columns?

2566–2568

style(9): please remove braces

sys/kern/subr_bus.c
5030

return (ENOMEM) ?

I don't know how to do that. Please make 'jfv' an account.

I guess this has already been committed with r275136

This was reverted by Jack; but John asked Alfred what he was trying to do in this post right before the change was reverted, with no reply: https://lists.freebsd.org/pipermail/svn-src-head/2014-December/065558.html. It sounds like a less messy way to do this, unless the network driver modules can't be dynamically loaded.

Since there is another proposal on the table (see erj@'s comment) please close this review.

This is not the case.

Having to load a module to get around the fact that TUNABLES are read too early in the device drivers is not the right way to do it.

It breaks netbooting for one.

It disallows making a kernel module to set some auto tuning.

This should stay open until someone comes up with a real solution for setting tunables after loader, but before device driver initialization.

If you don't understand then that's fine, contact me offline and we can arrange a call to explain the need and why you would want to fix this.

In D1149#31, @alfredperlstein wrote:

This is not the case.

Having to load a module to get around the fact that TUNABLES are read too early in the device drivers is not the right way to do it.

It breaks netbooting for one.

It disallows making a kernel module to set some auto tuning.

This should stay open until someone comes up with a real solution for setting tunables after loader, but before device driver initialization.

If you don't understand then that's fine, contact me offline and we can arrange a call to explain the need and why you would want to fix this.

Someone is probably going to tell me this is a terrible idea, but would adding an alternate mode to the driver for this work? I.e. Let the driver do a minimal amount of setup in attach(), then have the user call sysctls to set the number of queues / descriptors / other, and then have init() actually initialize everything? If that works, then you could use the sysctl_byname() call someone mentioned earlier.

In D1149#32, @erj wrote:
In D1149#31, @alfredperlstein wrote:

This is not the case.

Having to load a module to get around the fact that TUNABLES are read too early in the device drivers is not the right way to do it.

It breaks netbooting for one.

It disallows making a kernel module to set some auto tuning.

This should stay open until someone comes up with a real solution for setting tunables after loader, but before device driver initialization.

If you don't understand then that's fine, contact me offline and we can arrange a call to explain the need and why you would want to fix this.

Someone is probably going to tell me this is a terrible idea, but would adding an alternate mode to the driver for this work? I.e. Let the driver do a minimal amount of setup in attach(), then have the user call sysctls to set the number of queues / descriptors / other, and then have init() actually initialize everything? If that works, then you could use the sysctl_byname() call someone mentioned earlier.

Eric, that is a good idea.

This would allow for what would be needed/expected and will make it easy for people get the behavior needed. We at Norse could then realize a diff reduction against FreeBSD if this were to be implemented.

Can this please be done? Would serve as a useful example to other drivers as well.

Eric, that isn't a terrible idea, but a very good one! Queue management at runtime by sysadmin is definitely a good idea.

However, we need generic API for that, not a bunch of sysctls that is unique to every driver. Designing API, putting it in your driver and into ifconfig(1), so that it would be an example for other drivers. Look for example at SIOCGI2C, that Alexander did. Now it is supported only by Intel cards, but everнthing is here in place to add support for other cards.

So at this point, is the general concensus to abandon this review?

Alfred: Are you going to update the code to do what Gleb suggests?