Page MenuHomeFreeBSD

net: Add page/bank fields to ifi2cre; update mce(4) to use them
Needs ReviewPublic

Authored by gallatin on Wed, Mar 18, 12:41 PM.
Tags
None
Referenced Files
F149402294: D55912.diff
Tue, Mar 24, 5:36 AM
F149350945: D55912.diff
Mon, Mar 23, 10:28 PM
F149347293: D55912.diff
Mon, Mar 23, 9:56 PM
Unknown Object (File)
Fri, Mar 20, 4:06 AM

Details

Reviewers
kib
np
Summary

Add the ability to specify page and bank to ifi2creq so that SIOCGI2C can be used
to query CMIS optics. These optics require the ability to do page/bank selection.
ifconfig support for CMIS and 400g optics will come in a follow-on review.

I've also updated mce(4) to use the new page & bank fields. Mce(4) is currently the
only in-tree driver where I have access to 400g hardware.

Test Plan

After ifconfig is patched to detect CMIS, verify 400g optics are properly handed. Eg:

# /tmp/ifconfig -vvv mce0
mce0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
	options=1def07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,NV,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,TXRTLMT,MEXTPG,TXTLS4,TXTLS6>
	ether a0:88:c2:d2:be:82
	media: Ethernet 400GBase-LR8 <full-duplex>
	status: active
	nd6 options=829<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL,STABLEADDR>
	drivername: mce0
	plugged: QSFP8X 400GBASE-DR4 (MPO 1x12 Parallel Optic)
	vendor: EQUAL OPTICS PN: MMS4X00-NS400-EO SN: EOC12506126002 DATE: 2025-03-20
	module temperature: 47.11 C voltage: 3.25 Volts
	lane 1: RX power: 1.58 mW (1.98 dBm) TX bias: 47.41 mA
	lane 2: RX power: 1.61 mW (2.06 dBm) TX bias: 47.41 mA
	lane 3: RX power: 2.00 mW (3.01 dBm) TX bias: 48.27 mA
	lane 4: RX power: 1.49 mW (1.73 dBm) TX bias: 48.27 mA

	CMIS DUMP (Lower Memory 0..127):
	 19 40 04 07 00 00 00 00 00 00 00 00 00 00 2f 1b
	 7e f1 00 00 00 00 00 00 00 00 00 00 00 00 00 ff
	 ff ff ff 00 00 41 00 28 05 00 00 00 00 00 00 00
	 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	 00 00 00 00 00 02 32 1c 44 01 32 1c 22 05 50 1c
	 44 01 4f 1c 44 01 4e 1c 22 05 4d 1c 22 05 4c 14
	 11 0f 4b 14 11 0f 00 00 00 00 00 00 00 00 00 00

	CMIS DUMP (Page 00h 128..255):
	 19 45 51 55 41 4c 20 4f 50 54 49 43 53 20 20 20
	 20 48 b0 2d 4d 4d 53 34 58 30 30 2d 4e 53 34 30
	 30 2d 45 4f 31 41 45 4f 43 31 32 35 30 36 31 32
	 36 30 30 32 20 20 32 35 30 33 32 30 20 20 00 00
	 00 00 00 00 00 00 00 00 60 20 00 0c 00 00 00 00
	 00 00 f0 00 04 00 00 00 00 00 00 00 00 00 c2 00
	 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
# /tmp/ifconfig -vvv bnxt0
bnxt0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
	options=1ce507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG,TXTLS4,TXTLS6>
	ether a0:88:c2:d2:be:82
	hwaddr d4:04:e6:78:48:d0
	media: Ethernet autoselect (400GBase-DR4 <full-duplex>)
	status: active
	nd6 options=829<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL,STABLEADDR>
	drivername: bnxt0
	plugged: QSFP+(CMIS) 400G-FR4/400GBASE-FR4 (LC)
	vendor: FLEXOPTIX PN: Q.164HG.2.C SN: F7Z5BUJ DATE: 2025-08-22
	module temperature: 71.21 C voltage: 3.26 Volts
	lane 1: RX power: 1.44 mW (1.57 dBm) TX bias: 76.30 mA
	lane 2: RX power: 1.58 mW (1.99 dBm) TX bias: 93.52 mA
	lane 3: RX power: 1.68 mW (2.26 dBm) TX bias: 79.62 mA
	lane 4: RX power: 1.71 mW (2.34 dBm) TX bias: 89.81 mA

	CMIS DUMP (Lower Memory 0..127):
	 1e 52 00 06 01 00 00 00 00 00 00 00 00 00 47 36
	 7f 27 00 00 3c 00 00 00 00 00 60 00 00 00 00 00
	 00 00 f0 00 ff 00 00 03 01 00 00 00 00 00 00 00
	 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	 00 00 00 00 00 02 4f 1d 44 01 50 1d 44 01 4b 15
	 11 0f 4c 15 11 0f 4d 00 22 05 4e 00 22 05 ff 00
	 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

	CMIS DUMP (Page 00h 128..255):
	 1e 46 4c 45 58 4f 50 54 49 58 20 20 20 20 20 20
	 20 38 86 02 51 2e 31 36 34 48 47 2e 32 2e 43 20
	 20 20 20 20 30 31 46 37 5a 35 42 55 4a 20 20 20
	 20 20 20 20 20 20 32 35 30 38 32 32 20 20 20 20
	 20 20 20 20 20 20 20 20 e0 24 00 07 00 00 00 00
	 00 00 f0 00 06 00 00 00 00 00 00 00 00 00 bd 00
	 51 2e 31 36 34 48 47 2e 32 2e 43 20 20 20 20 20
	 00 00 00 00 00 00 00 00 00 00 00 00 00 00 31 36

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

gallatin edited the test plan for this revision. (Show Details)
gallatin edited the test plan for this revision. (Show Details)
sys/net/if.h
607

I suspect that this change is not backward-compatible. Before, the garbage in the new page and bank fields was ignored, am I right?

Also, there are other drivers that ignore that field which means that passing the new request in fact reports invalid data in response.

sys/net/if.h
607

It depends on how paranoid we want to be. We have exactly 1 in-tree caller of of this API in lib/libifconfig/libifconfig_sfp.c:read_i2c, and it is careful to zero out its ifi2creq before use. We have one (closed source) caller at Netflix, and it is also careful to zero out its ifi2creq prior to issuing the SIOCGI2C ioctl. I did not audit ports.

In terms of an application setting a non-zero page or bank and getting back data from page 0 / bank 0 because the driver did not look at the page / bank fields, that seems like a driver bug. Eg, if the driver supports 400g optics, it should support the new fields. That's one reason I included Navdeep in the review, as I don't have a T7 and don't feel confident in adding support there.

In my opinion, it seems like overkill to plumb an entire SIOCGI2CEXT ioctl and IFDI_I2CEXT_REQ (for iflib) here.

sys/net/if.h
607

It depends on how paranoid we want to be. We have exactly 1 in-tree caller of of this API in lib/libifconfig/libifconfig_sfp.c:read_i2c, and it is careful to zero out its ifi2creq before use. We have one (closed source) caller at Netflix, and it is also careful to zero out its ifi2creq prior to issuing the SIOCGI2C ioctl. I did not audit ports.

Are we serious about ABI stability? How can we know what are the users of the interface?

In terms of an application setting a non-zero page or bank and getting back data from page 0 / bank 0 because the driver did not look at the page / bank fields, that seems like a driver bug. Eg, if the driver supports 400g optics, it should support the new fields. That's one reason I included Navdeep in the review, as I don't have a T7 and don't feel confident in adding support there.

It is not a driver bug to ignore the fields before the change, but suddenly becomes the bug afterward.

@kib What if we reclaimed some of the spare fields in ifi2creq to a uint16_t magic field. User sets i2creq.magic=PAGE_BANK_REQ and driver sets magic=PAGE_BANK_ACK ? That prevents the confusion you're concerned about.

sys/net/if.h
607

It depends on how paranoid we want to be. We have exactly 1 in-tree caller of of this API in lib/libifconfig/libifconfig_sfp.c:read_i2c, and it is careful to zero out its ifi2creq before use. We have one (closed source) caller at Netflix, and it is also careful to zero out its ifi2creq prior to issuing the SIOCGI2C ioctl. I did not audit ports.

Are we serious about ABI stability? How can we know what are the users of the interface?

I requested portmgr sweep for users of SIOCGI2C and I can take a look.

In terms of an application setting a non-zero page or bank and getting back data from page 0 / bank 0 because the driver did not look at the page / bank fields, that seems like a driver bug. Eg, if the driver supports 400g optics, it should support the new fields. That's one reason I included Navdeep in the review, as I don't have a T7 and don't feel confident in adding support there.

It is not a driver bug to ignore the fields before the change, but suddenly becomes the bug afterward.

Well, its a bug that's no worse than what we have today. Eg, ifconfig -vv mce0 for CX7 reports incorrect optic info, no thermals, no light levels, and missing or garbled optic pn/sn/ date info, since the current code mistakenly treats cmis as if it was qsfp. This was a bug reported to me at $work, and why I'm working on proper support for CMIS.

Eg:

c001.was001.dev# ifconfig -vvv mce0
mce0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
	options=1def07bf<RXCSUM,TXCSUM,EXTX,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,NV,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,TXRTLMT,MEXTPG,TXTLS4,TXTLS6>
	ether c4:70:bd:a3:00:d4
	media: Ethernet 400GBase-LR8 <full-duplex>
	status: active
	nd6 options=829<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL,STABLEADDR>
	drivername: mce0
	plugged: QSFP8X 1X Copper Passive (BNC/TNC)
	vendor:  PN:  SN:  DATE: 20

	SFF8472 DUMP (0xA0 0..127 range):
	 19 40 04 07 0f 00 00 00 00 00 00 00 00 00 2c 84
	 80 e8 00 00 00 00 00 00 00 00 00 00 00 00 00 ff
	 ff ff ff 00 00 41 00 28 05 00 00 00 00 00 00 00
	 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	 00 00 00 00 00 02 32 1c 44 01 32 1c 22 05 50 1c
	 44 01 4f 1c 44 01 4e 1c 22 05 4d 1c 22 05 4c 14
	 11 0f 4b 14 11 0f 00 00 00 00 00 00 00 00 00 00

Why not simply not allocate the new number for the new ioctl. Rename current ioctl to something like SIOCGI2C_NOBANK. Then, at the ioctl level, do two things:

  • for new SIOCGI2C check that reserved bytes in the structure are zero
  • for SIOCGI2C_NOBANK, zero bank/page/padding and call into SIOCGI2C
In D55912#1279758, @kib wrote:

Why not simply not allocate the new number for the new ioctl. Rename current ioctl to something like SIOCGI2C_NOBANK. Then, at the ioctl level, do two things:

  • for new SIOCGI2C check that reserved bytes in the structure are zero
  • for SIOCGI2C_NOBANK, zero bank/page/padding and call into SIOCGI2C

I think I'd rather go the other direction, so as to avoid touching legacy ioctl users from ports. It turns out there are many from the ports sweep.

Eg, we add SIOCGI2CPB (where PB means page/bank) for new code. Drivers like mlx5 which can support page and/or bank access add a fallthru to SIOCGI2CPB to their SIOCGI2C case and zero page & bank for SIOCGI2C. Eg:

	case SIOCGI2C:
	case SIOCGI2CPB:
		ifr = (struct ifreq *)data;

		/*
		 * Copy from the user-space address ifr_data to the
		 * kernel-space address i2c
		 */
		error = copyin(ifr_data_get_ptr(ifr), &i2c, sizeof(i2c));
		if (error)
			break;

		/* Esnure page & bank are zero for legacy SIOCGI2C ioctl */
		if (command == SIOCGI2C) {
			i2c.page = i2c.bank = 0;
		}
In D55912#1279758, @kib wrote:

Why not simply not allocate the new number for the new ioctl. Rename current ioctl to something like SIOCGI2C_NOBANK. Then, at the ioctl level, do two things:

  • for new SIOCGI2C check that reserved bytes in the structure are zero
  • for SIOCGI2C_NOBANK, zero bank/page/padding and call into SIOCGI2C

I think I'd rather go the other direction, so as to avoid touching legacy ioctl users from ports. It turns out there are many from the ports sweep.

Eg, we add SIOCGI2CPB (where PB means page/bank) for new code. Drivers like mlx5 which can support page and/or bank access add a fallthru to SIOCGI2CPB to their SIOCGI2C case and zero page & bank for SIOCGI2C. Eg:

	case SIOCGI2C:
	case SIOCGI2CPB:
		ifr = (struct ifreq *)data;

		/*
		 * Copy from the user-space address ifr_data to the
		 * kernel-space address i2c
		 */
		error = copyin(ifr_data_get_ptr(ifr), &i2c, sizeof(i2c));
		if (error)
			break;

		/* Esnure page & bank are zero for legacy SIOCGI2C ioctl */
		if (command == SIOCGI2C) {
			i2c.page = i2c.bank = 0;
		}

I think your proposal is also fine, but why? What I propose

  1. provide perfect binary compat for old consumers
  2. allows to avoid modifying drivers at all, except in case when the bank/page support is added.

I think that I may not understand your proposal. When you say "at the ioctl level..", Do you mean intercept the ioctls in ifhwioctl()? I think that seems awkward because ifi2req lives in userspace, and individual drivers copy it in / out, rather than it being done in a single spot for all drivers. So to zero fields, ifhwioctl would need copy-in ifi2creq, fix it up, & copy it out again..

My proposal doesn't have this requirement, and also only requires changes to drivers supporting page/bank operation...