ZFS ARC: remove TRIM-ing of cache (L2ARC) devices
AbandonedPublic
Actions

Authored by avg on Feb 15 2017, 1:27 PM.

Details

Reviewers

mav
smh

Summary

The devices are being over-written with new data in a circular fashion.
There is no need to worry about the freed blocks.

It might make sense to issue one giant trim to the whole cache device
at the start, but not sure if that's going to help much.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

No Lint Coverage

Unit

No Test Coverage

Build Status

Buildable 7478
Build 7639: arc lint + arc unit

Event Timeline

avg updated this revision to Diff 25214.Feb 15 2017, 1:27 PM

avg retitled this revision from to ZFS ARC: remove TRIM-ing of cache (L2ARC) devices.

avg updated this object.

avg edited the test plan for this revision. (Show Details)

avg added reviewers: smh, mav.

Herald added subscribers: delphij, imp. · View Herald TranscriptFeb 15 2017, 1:27 PM

I have no objections. I don't know for sure whether circular rewrite makes TRIM completely useless for SSDs, but I accept arguments that TRIM is expensive and that that divergence from up OpenZFS is not good.

This can could excessive slow down as the capacity of the disk is reached, however it could be argued that a better mitigation for L2ARC devices would be to use an under-provisioned slice to ensure the SSD controller always has space to work.

I think it would be good to get some real world numbers of both before and after this patch.

The most likely performance regression I can think of is where L2ARC is actively used close to the capacity limit at a low cycle rate. In that case the advisory TRIMs will ensure that the subsequent writes are kept at the devices optimal performance instead of tanking.

I think there are two sides to consider:

When SSD is new, circular rewrite pattern of L2ARC should make TRIM useless from wear leveling perspective, since all written data are regularly rewritten, that gives firmware all possibilities to do its job.
When SSD is old, firmware may have problems with finding relatively fresh blocks after all of SSD's logical capacity is filled, and so it has to write into slow worn out blocks. In such case using TRIM can give benefits if significant portion of L2ARC space is regularly freed. I am just not sure how much SSDs in that half-dead state should be counted as target. It should not be difficult for user to manually under-provision work out SSD, if needed.

In scenario #1 the performance of TRIM is also generally good, mitigating the need to avoid doing it.

In D9611#198579, @smh wrote:

In scenario #1 the performance of TRIM is also generally good, mitigating the need to avoid doing it.

Is there such thing as good TRIM performance? On my new Samsung 950 NVMe I had to disable TRIM as unusable. Though yes, NVMe driver still probably needs aggregation of TRIM requests to get better numbers.

In D9611#198582, @mav wrote:

In D9611#198579, @smh wrote:

In scenario #1 the performance of TRIM is also generally good, mitigating the need to avoid doing it.

Is there such thing as good TRIM performance? On my new Samsung 950 NVMe I had to disable TRIM as unusable. Though yes, NVMe driver still probably needs aggregation of TRIM requests to get better numbers.

It is indeed a very mixed bag.

TRIM coalescing on NVMe should be less of an issue due to the protocol, that said the underlying flash layout could still cause significant problems especially if the controller doesn't support background TRIM processing.

My understanding of how L2ARC writing works is this. The code maintains a "hand" (like a clock hand) that points to a disk offset. At the regular intervals a certain space in front of the hand is freed by discarding L2 headers that point to that space and then new buffers are written to that space. Then the hand is moved forward by the appropriate amount.
There is also some freeing of L2 headers when ARC headers are freed, etc. In any case, after some uptime almost the whole cache disk is usually filled with data. And the hand inevitably moves forward. So, every block gets written over sooner or later. I do not see how TRIM helps in that case.
The only scenario where it makes difference, in my humble opinion, is the scenario that @mav described: a worn out disk where the "cheese holes" behind the hand can make some difference for writing new blocks at the hand. But I think that that's too marginal to be important.

In D9611#198616, @avg wrote:

My understanding of how L2ARC writing works is this. The code maintains a "hand" (like a clock hand) that points to a disk offset. At the regular intervals a certain space in front of the hand is freed by discarding L2 headers that point to that space and then new buffers are written to that space. Then the hand is moved forward by the appropriate amount.
There is also some freeing of L2 headers when ARC headers are freed, etc. In any case, after some uptime almost the whole cache disk is usually filled with data. And the hand inevitably moves forward. So, every block gets written over sooner or later. I do not see how TRIM helps in that case.
The only scenario where it makes difference, in my humble opinion, is the scenario that @mav described: a worn out disk where the "cheese holes" behind the hand can make some difference for writing new blocks at the hand. But I think that that's too marginal to be important.

Yes and no, the external view of the devices capacity and the controllers view are two very different things (due to the FTL), even when writing to the same LBA as far as we're concerned, internally this is not using the same underlying flash area due to FTL and the controllers desire to maintain wear levelling. Essentially the position of ZFS's "hand" is largely irrelevant.

That said if the disk is always close enough to capacity such that there is no spare time between the TRIM and the subsequent reuse of the space TRIM is likely to have very little impact, if however it can enable the controller to keep ahead enough such that it can avoid P/E cycle to facilitate the write then performance can be significantly increased.

The problem is no two devices are the same, so there is no right answer.

If anyone's unfamiliar with the internals of flash, I'm not making any assumptions ;-) have a read of http://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents/

In a perfect world TRIM should always be beneficial, but that's definitely NOT the case as we all know.

Yes and no, the external view of the devices capacity and the controllers view are two very different things (due to the FTL), even when writing to the same LBA as far as we're concerned, internally this is not using the same underlying flash area due to FTL and the controllers desire to maintain wear levelling.

Yes, of course.

Essentially the position of ZFS's "hand" is largely irrelevant.

No, it's still relevant. Because while the new data can get written to a "physically" different place, the controller now knows that the old "physical" place is free, because "logically" the data at the hand got overwritten. The latter is the same as with TRIM.

Essentially, L2ARC periodically writes to every block on the device and that allows to physically move the data around.
That's exactly the reason why I think that TRIM is not needed for L2ARC.
TRIM is useful when we don't need data in some area, but we are not going to overwrite that area, so we need to a way to tell the storage system that it can reuse the physical cells without worrying about any data in them. But if we overwrite that area anyway, then the storage system is automatically aware that the data in those physical cells is obsolete. It's free to choose either those same cells or any different cells for the new data according to the wear leveling algorithms, but that's beside the point.

In D9611#198637, @avg wrote:

Essentially, L2ARC periodically writes to every block on the device and that allows to physically move the data around.
That's exactly the reason why I think that TRIM is not needed for L2ARC.
TRIM is useful when we don't need data in some area, but we are not going to overwrite that area, so we need to a way to tell the storage system that it can reuse the physical cells without worrying about any data in them. But if we overwrite that area anyway, then the storage system is automatically aware that the data in those physical cells is obsolete. It's free to choose either those same cells or any different cells for the new data according to the wear leveling algorithms, but that's beside the point.

That's where time comes into it, if L2ARC frees blocks, waits a little then writes blocks, the device can do just Program (as the erase was done previously) instead of a full Erase + Program which is significantly quicker.

Its not clear to me which case L2ARC would trigger.

The change is trivial, but the decision is not :-)

Indeed, do you have any bandwidth to do proper testing to prove either way or have you already done this?

In D9611#201896, @smh wrote:

Indeed, do you have any bandwidth to do proper testing to prove either way or have you already done this?

To be honest, I don't even know what I would test.
Do you have any suggestions?

Revision Contents
Changeset List

Path

Size

sys/

cddl/

contrib/

opensolaris/

uts/

common/

fs/

zfs/

arc.c

24 lines

Diff 25214

View Options

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c

ZFS ARC: remove TRIM-ing of cache (L2ARC) devicesAbandonedPublicActions