Page MenuHomeFreeBSD

CISS: Support >48 JBOD drives, fix SES enumeration, no panic on unplug, sysctl tunables &more verbosity...
ClosedPublic

Authored by pen_lysator.liu.se on Jun 5 2020, 6:42 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Jan 1, 11:47 PM
Unknown Object (File)
Wed, Dec 25, 7:12 PM
Unknown Object (File)
Sat, Dec 21, 12:41 AM
Unknown Object (File)
Wed, Dec 18, 10:29 AM
Unknown Object (File)
Nov 25 2024, 1:09 PM
Unknown Object (File)
Nov 24 2024, 9:28 PM
Unknown Object (File)
Nov 24 2024, 1:29 PM
Unknown Object (File)
Nov 23 2024, 3:53 AM

Details

Summary

This patch fixes the handling of many drives in JBOD (HBA) mode
The current code fails to handle more than ~48 drives per controller
depending on controller model - but for a HP H241 which supports 64
logical volumes this is true)

Also fixes the panic when removing and replugging devices while active.

And adds visible sysctl's for tunables.

Test Plan

Tested the patch with HP H241 SmartHBA cards and two HP D6020 external
SAS HBA cabinets with 70+70 12TB drives. Also tried it on an older HP
server with the really old HP P400 SmartArray controller card with 6 drives.
It would be nice to have someone else test it with many drives on a
SmartArray controller (in JBOD) mode.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Removed a couple of (1 ||) (to force printing of info)

Tag in the last two committers that touched it, as well.

sys/dev/ciss/ciss.c
3043 ↗(On Diff #72736)

sys/param.h provides a MAX() macro that you should use here.

sys/dev/ciss/cissvar.h
239

The name should be tabbed over, rather than spaced over, to match the lines above it.

Space fixes and remove comments and use MAX

I am not very close to ciss, only fixed couple issues there, but it looks OK to me, except the initiator_id comment and somewhat weird way to detect max target number.

sys/dev/ciss/ciss.c
3050 ↗(On Diff #72740)

0 is a wrong value here, since it may affect scanning for target 0. Something like CAM_TARGET_WILDCARD could be better, unless special value introduced..

This revision is now accepted and ready to land.Jun 5 2020, 7:40 PM
pen_lysator.liu.se marked 2 inline comments as done.

More space...

This revision now requires review to proceed.Jun 5 2020, 7:45 PM
pen_lysator.liu.se marked an inline comment as done.

More fixes (sysctl, verbose param, fix for panic-when-devices removed/replugged)

This looks decent to me with the changes. I'd test it, but I have no ciss hardware

sys/dev/ciss/ciss.c
297 ↗(On Diff #72746)

Stray new newline

4222 ↗(On Diff #72746)

I'd just remove the panic and the ifdef, keeping just the printf.

This revision is now accepted and ready to land.Jun 5 2020, 10:27 PM
pen_lysator.liu.se marked 2 inline comments as done.

Remove stray newline and remove the panic() (and #if/#endif)

This revision now requires review to proceed.Jun 6 2020, 6:53 AM

Adds a fix to allow SES to detect stuff behind the CISS controllers and more verbosity

Regarding testing the patch - I'll try to find some more/other models of HP RAID controllers and verify the functionality with them. There should be people out there using HP controllers already on smaller machines so I can try to find some people using HP RAID cards to test it too...

Added a fix for another problem that prevents SES ("sesutil map" for example) to get information about what is connected behind CISS controllers in the latest update.

pen_lysator.liu.se retitled this revision from CISS driver: Support more than 48 JBOD drives on HP SmartArray/SmartHBA cards i JBOD mode to CISS: Support >48 JBOD drives, fix SES enumeration, no panic on unplug, sysctl tunables &more verbosity....Jun 6 2020, 2:38 PM

I have rebased patch against the recent HEAD, compiled, installed and rebooted. The hardware for test was Hp Proliant Dl360 G7 with
ciss0: <HP Smart Array P410i> port 0x4000-0x40ff mem 0xfbe00000-0xfbffffff,0xfbdf0000-0xfbdf0fff irq 28 at device 0.0 on pci1
with 3 logical drives and firmware 6.64. This controller is not capable of switching to JBOD, at least when flashed with official firmware, so I was using each disk as a separate logical drive.

Unfortunately, the system wasn't able to mount / from ZFS, it was stuck looping:
ciss0: command status 0x1 (target status) scsi status 0x2 (opcode 0x1a)
ciss0: command status 0x1 (target status) scsi status 0x2 (opcode 0x1a)
ciss0: command status 0x1 (target status) scsi status 0x2 (opcode 0x1a)
ciss0: command status 0x1 (target status) scsi status 0x2 (opcode 0x1a)
Root mount waiting for: CAM

I have rebased patch against the recent HEAD, compiled, installed and rebooted. The hardware for test was Hp Proliant Dl360 G7 with
ciss0: <HP Smart Array P410i> port 0x4000-0x40ff mem 0xfbe00000-0xfbffffff,0xfbdf0000-0xfbdf0fff irq 28 at device 0.0 on pci1
with 3 logical drives and firmware 6.64. This controller is not capable of switching to JBOD, at least when flashed with official firmware, so I was using each disk as a separate logical drive.

Unfortunately, the system wasn't able to mount / from ZFS, it was stuck looping:
ciss0: command status 0x1 (target status) scsi status 0x2 (opcode 0x1a)
ciss0: command status 0x1 (target status) scsi status 0x2 (opcode 0x1a)
ciss0: command status 0x1 (target status) scsi status 0x2 (opcode 0x1a)
ciss0: command status 0x1 (target status) scsi status 0x2 (opcode 0x1a)
Root mount waiting for: CAM

So opcode 0x1a is 'MODE SENSE(6)' which should generally work... but these messages are saying that the target didn't report status.

I have a while to take another look into this promising patch and found out that the culprit of not mounting the root filesystem was the setting: hw.ciss.expose_hidden_physical=1 in loader.conf, which really did nothing before* the patch but also was harmless. After removing this line from loader.conf I booted successfully in normal and verbose mode. Since I am not able to update the review, I paste links to the patch. I skipped changes in lines 1287-1289 and had to change a bit 2345-2346 to rebase on the recent main.
https://zarychtam.pb.pwste.edu.pl/diffs/D25155_rebased.diff
https://zarychtam.pb.pwste.edu.pl/diffs/D25155_rebased_full.diff
Perhaps the author can look into this and fix, add 1287-1289 if relevant and update the review. It's still probably possible to reach the main before stable/14 gets branched.

  • - It really did something, hw.ciss.expose_hidden_physical=1 was exposing physical drives as pass(4) devices which for example allowed to check their health with sysutils/smartmontools.
kaktus added inline comments.
sys/dev/ciss/ciss.c
247 ↗(On Diff #72777)

CTLFLAG_RD | CTLFLAG_MPSAFE

I tried to get rid of problem with hidden devices and applied the change suggested by kaktus. Updated diff works for me.

https://zarychtam.pb.pwste.edu.pl/diffs/D25155_rebased_full.diff

sys/dev/ciss/ciss.c
1698–1699 ↗(On Diff #72777)

Replacing this line with:

if ((target > sc->ciss_max_physical_target) &&
    (cll->lun[i].physical.mode != CISS_HDR_ADDRESS_MODE_MASK_PERIPHERAL))

fixes root mount while booting with ciss_expose_hidden_physical="1" and makes hidden pass(4) devices visible.

I tried to get rid of problem with hidden devices and applied the change suggested by kaktus. Updated diff works for me.

https://zarychtam.pb.pwste.edu.pl/diffs/D25155_rebased_full.diff

fetch: https://zarychtam.pb.pwste.edu.pl/diffs/D25155_rebased_full.diff: Operation timed out

Any chance you can send me this via email? imp@freebsd.org

I also committed the typo... I may commit the ciss_verbose as well separately.

The link was temporarily disabled due to problems with the Apache where I was testing things related to another PR 268318. It should work now. I am sorry for the inconvenience.

sys/dev/ciss/ciss.c
1648 ↗(On Diff #72777)

Prepending two spaces before " max" will make formatting consistent with other ciss_printf commands above.

sys/dev/ciss/ciss.c
1648 ↗(On Diff #72777)

I was wrong, this part looks fine in the place where it's displayed. Please leave it as is. Please let me apologise for the wasteful comment I made above.

The rebased patch applies clearly on stable/13, there were also no problems running patched stable/13. I hope it will make it into the head before stable/14 is branched. Whilst additional ciss(4) verbosity could be helpful, fixing the enumeration problem for a larger number of disks is critical.

I am still testing this solution 24/7. So far I had no issues.
The patch rebased against recent CURRENT is hosted here:
https://zarychtam.pb.pwste.edu.pl/diffs/D25155_rebased_full.diffs

I am still testing this solution 24/7. So far I had no issues.
The patch rebased against recent CURRENT is hosted here:
https://zarychtam.pb.pwste.edu.pl/diffs/D25155_rebased_full.diffs

Hmmm, no it isn't... Is there a reason?

In D25155#927401, @imp wrote:

I am still testing this solution 24/7. So far I had no issues.
The patch rebased against recent CURRENT is hosted here:
https://zarychtam.pb.pwste.edu.pl/diffs/D25155_rebased_full.diffs

Hmmm, no it isn't... Is there a reason?

The current patch doesn't apply... Let's try to get this done this go around.

In D25155#927416, @imp wrote:
In D25155#927401, @imp wrote:

I am still testing this solution 24/7. So far I had no issues.
The patch rebased against recent CURRENT is hosted here:
https://zarychtam.pb.pwste.edu.pl/diffs/D25155_rebased_full.diffs

Hmmm, no it isn't... Is there a reason?

The current patch doesn't apply... Let's try to get this done this go around.

I am sorry, the situation with this hosting is dynamic. Please try this one, it applies clearly:
https://zarychtam.pb.pwste.edu.pl/diffs/D25155_rebased.diff

The patch still applies and works great on recent 15-CURRENT.

I will keep you updated. It still applies, builds and runs fine on CURRENT and stable/14. The patch in the link below is rebased against recent CURRENT
https://zarychtam.pb.pwste.edu.pl/diffs/D25155_rebased.diff