Page MenuHomeFreeBSD

linuxkpi: Introduce a properly typed jiffies
Needs ReviewPublic

Authored by markj on Mon, Jan 20, 4:40 PM.
Tags
None
Referenced Files
F109170796: D48523.id149557.diff
Sat, Feb 1, 5:24 PM
Unknown Object (File)
Thu, Jan 30, 12:00 PM
Unknown Object (File)
Wed, Jan 29, 9:45 AM
Unknown Object (File)
Tue, Jan 28, 5:17 AM
Unknown Object (File)
Tue, Jan 28, 5:07 AM
Unknown Object (File)
Mon, Jan 27, 5:23 PM
Unknown Object (File)
Mon, Jan 27, 12:51 PM
Unknown Object (File)
Mon, Jan 27, 12:20 PM

Details

Reviewers
bz
manu
Group Reviewers
linuxkpi
Summary

Now that we have a long-sized tick counter, we can migrate to using
properly typed timeout parameters in various bits of the LinuxKPI.

  • Introduce a "jiffies" symbol in subr_ticks.S, declared only in the LinuxKPI as an unsigned long.
  • Remove all references to "ticks" from the LinuxKPI.
  • Convert interfaces to match Linux's type signatures where it makes sense.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 62147
Build 59031: arc lint + arc unit

Event Timeline

markj requested review of this revision.Mon, Jan 20, 4:40 PM
sys/compat/linuxkpi/common/include/linux/jiffies.h
39–40

Not sure what this comment is getting at

40

We might want to switch the order, making jififes_64 real -- AFAICT this is the case in Linux and jiffies is set to alias it (or jiffies_64 + 4) by linker script.

manu added a subscriber: manu.
manu added inline comments.
sys/compat/linuxkpi/common/include/linux/jiffies.h
39–40

The comment should probably says that it's defined in subr_ticks.S

This revision is now accepted and ready to land.Mon, Jan 20, 5:06 PM
sys/compat/linuxkpi/common/include/linux/jiffies.h
39–40

The jiffies symbol is defined in the kernel, not in linuxkpi.ko.

40

But we're not implementing jiffies_64 here, just jiffies. On 64-bit platforms they're equivalent, but I didn't implement a 64-bit tick counter for 32-bit platforms: it's more complex, 32-bit platforms are slowly going away, and jiffies_64 is not widely used.

markj marked 2 inline comments as done.

Make the comment more useful.

This revision now requires review to proceed.Mon, Jan 20, 5:13 PM
This revision is now accepted and ready to land.Mon, Jan 20, 5:23 PM

Any testing of the patch, whether it's DRM, wifi, or the infiniband stack, would be appreciated.

Any testing of the patch, whether it's DRM, wifi, or the infiniband stack, would be appreciated.

Ah, I was going to ask you if you tested drm with this :)

Any testing of the patch, whether it's DRM, wifi, or the infiniband stack, would be appreciated.

Ah, I was going to ask you if you tested drm with this :)

Not yet. I plan to, just haven't gotten around to it yet. And, I'm only set up to test i915kms at the moment. I can also test iwlwifi, but not other wifi drivers.

I'll try to do both the testing and a review in the next 72 hours if that is ok.

In D48523#1107340, @bz wrote:

I'll try to do both the testing and a review in the next 72 hours if that is ok.

Thank you, no rush at all.

sys/compat/linuxkpi/common/include/linux/jiffies.h
43

What is this constant about?

sys/compat/linuxkpi/common/include/linux/wait.h
151–153

Why these casts (old/int new/long) are needed at all?

169

Are linux jiffies unsigned?

Anyway, I suspect the cast there is nop.

sys/compat/linuxkpi/common/include/linux/jiffies.h
40

Yeah, I realized that after leaving the comment. This probably warrants an XXX comment that this is broken on 32-bit platforms.

sys/compat/linuxkpi/common/include/linux/wait.h
169

Linux jiffies are indeed unsigned

markj added inline comments.
sys/compat/linuxkpi/common/include/linux/jiffies.h
43

It's defined in Linux as well, seems to be used as a maximum value when converting jiffies to other units.

sys/compat/linuxkpi/common/include/linux/wait.h
151–153

I'm not sure if it's needed or not, but it causes negative values to be treated as in the past.

markj marked an inline comment as done.

Fix MAX_JIFFY_OFFSET, remove an uneeded cast, add an XXX comment for jiffies_64.

This revision now requires review to proceed.Fri, Jan 24, 3:48 PM

When testing the new patch, I encountered the same issue where the mce0 device disappeared,
accompanied by the following error in the dmesg log.

Dmesg:
mce0: link state changed to DOWN
mce1: link state changed to DOWN
mlx5_core: INFO: (mlx5_core0): E-Switch: cleanup
mlx5_core0: WARN: wait_func:968:(pid 89646): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource
mlx5_core0: WARN: mlx5_destroy_unmap_eq:528:(pid 89646): failed to destroy a previously created eq: eqn 4
mlx5_core0: WARN: wait_func:968:(pid 89646): DEALLOC_UAR(0x803) timeout. Will cause a leak of a command resource
mlx5_core0: WARN: up_rel_func:87:(pid 89646): failed to free uar index 256
mlx5_core0: WARN: wait_func:968:(pid 89646): TEARDOWN_HCA(0x103) timeout. Will cause a leak of a command resource
mlx5_core0: ERR: mlx5_unload_one:1345:(pid 89646): tear_down_hca failed, skip cleanup
mlx5_core0: ERR: remove_one:1786:(pid 89646): mlx5_unload_one() failed, leaked 14360576 bytes
mlx5_core0: detached
pci3: <network, ethernet> at device 0.0 (no driver attached)
mlx5_core: INFO: (mlx5_core1): E-Switch: cleanup
mlx5_core1: WARN: wait_func:968:(pid 89646): DESTROY_EQ(0x302) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: mlx5_destroy_unmap_eq:528:(pid 89646): failed to destroy a previously created eq: eqn 4
mlx5_core1: WARN: wait_func:968:(pid 89646): DEALLOC_UAR(0x803) timeout. Will cause a leak of a command resource
mlx5_core1: WARN: up_rel_func:87:(pid 89646): failed to free uar index 256
mlx5_core1: WARN: wait_func:968:(pid 89646): TEARDOWN_HCA(0x103) timeout. Will cause a leak of a command resource
mlx5_core1: ERR: mlx5_unload_one:1345:(pid 89646): tear_down_hca failed, skip cleanup
mlx5_core1: ERR: remove_one:1786:(pid 89646): mlx5_unload_one() failed, leaked 14413824 bytes
mlx5_core1: detached
pci3: <network, ethernet> at device 0.1 (no driver attached)
mlx5_core0: <mlx5_core> mem 0x880000000-0x881ffffff irq 22 at device 0.0 on pci3
mlx5: Mellanox Core driver 3.7.1 (November 2021)mlx5_core0: WARN: wait_func:968:(pid 89677): ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource
mlx5_core0: ERR: mlx5_load_one:1124:(pid 89677): enable hca failed
mlx5_core0: ERR: init_one:1709:(pid 89677): mlx5_load_one failed -60
device_attach: mlx5_core0 attach returned 60
mlx5_core0: <mlx5_core> mem 0x882000000-0x883ffffff irq 23 at device 0.1 on pci3
mlx5_core0: WARN: wait_func:968:(pid 89677): ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource
mlx5_core0: ERR: mlx5_load_one:1124:(pid 89677): enable hca failed
mlx5_core0: ERR: init_one:1709:(pid 89677): mlx5_load_one failed -60
device_attach: mlx5_core0 attach returned 60

repro:
for i in {1..40}; do

echo "Iteration $i"
kldunload mlx5en
kldunload mlx5ib
sleep 1
kldload mlx5en
kldload mlx5ib
sleep 1

done

When testing the new patch, I encountered the same issue where the mce0 device disappeared,
accompanied by the following error in the dmesg log.

Thank you for testing. It seems that mlx5 (and probably mlx4) assume that jiffies is an int in some places, e.g., wait_func(). I am willing to work on converting these to unsigned long where appropriate. Presumably this would ease maintenance of shared code, but I am not the maintainer of these drivers - how would you like to handle it?

BTW, I have access to a mlx5 device and can test it locally, but would need help with mlx4.

When testing the new patch, I encountered the same issue where the mce0 device disappeared,
accompanied by the following error in the dmesg log.

Thank you for testing. It seems that mlx5 (and probably mlx4) assume that jiffies is an int in some places, e.g., wait_func(). I am willing to work on converting these to unsigned long where appropriate. Presumably this would ease maintenance of shared code, but I am not the maintainer of these drivers - how would you like to handle it?

Indeed I tried the following but it was not enough at least.

From ce2d32a56f3f73df01b30e3a6142565cbd869def Mon Sep 17 00:00:00 2001
From: Konstantin Belousov <kib@FreeBSD.org>
Date: Mon, 27 Jan 2025 14:12:05 +0200
Subject: [PATCH] mlx5: jiffies is unsigned long

---
 sys/dev/mlx5/mlx5_core/mlx5_cmd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sys/dev/mlx5/mlx5_core/mlx5_cmd.c b/sys/dev/mlx5/mlx5_core/mlx5_cmd.c
index 8ce30bc24e50..52c29c5511f2 100644
--- a/sys/dev/mlx5/mlx5_core/mlx5_cmd.c
+++ b/sys/dev/mlx5/mlx5_core/mlx5_cmd.c
@@ -247,7 +247,7 @@ static void poll_timeout(struct mlx5_cmd_work_ent *ent)
 {
 	struct mlx5_core_dev *dev = container_of(ent->cmd,
 						 struct mlx5_core_dev, cmd);
-	int poll_end = jiffies +
+	long poll_end = jiffies +
 				msecs_to_jiffies(MLX5_CMD_TIMEOUT_MSEC + 1000);
 	u8 own;
 
@@ -950,7 +950,7 @@ static const char *deliv_status_to_str(u8 status)
 
 static int wait_func(struct mlx5_core_dev *dev, struct mlx5_cmd_work_ent *ent)
 {
-	int timeout = msecs_to_jiffies(MLX5_CMD_TIMEOUT_MSEC);
+	unsigned long timeout = msecs_to_jiffies(MLX5_CMD_TIMEOUT_MSEC);
 	int err;
 
 	if (ent->polling) {
-- 
2.48.1
In D48523#1111725, @kib wrote:

When testing the new patch, I encountered the same issue where the mce0 device disappeared,
accompanied by the following error in the dmesg log.

Thank you for testing. It seems that mlx5 (and probably mlx4) assume that jiffies is an int in some places, e.g., wait_func(). I am willing to work on converting these to unsigned long where appropriate. Presumably this would ease maintenance of shared code, but I am not the maintainer of these drivers - how would you like to handle it?

Indeed I tried the following but it was not enough at least.

I think pretty much every use of msecs_to_jiffies() in mlx5 needs to be fixed. Every reference to jiffies needs to be audited too, at least.

Update round_jiffies() as well.

I think pretty much every use of msecs_to_jiffies() in mlx5 needs to be fixed. Every reference to jiffies needs to be audited too, at least.

Ok I will do the pass.