Page MenuHomeFreeBSD

Correct adaptation ZFS ARC memory pressure to FreeBSD
Needs ReviewPublic

Authored by slw_zxy.spb.ru on Aug 16 2016, 9:59 PM.

Details

Summary
  1. On illumos, needfree is part of the VM subsystem, which ZFS just observes.

FreeBSD emulate this. Current emulations is ugly: on pressure event needfree set to btoc(arc_c >> arc_shrink_shift) and don't change until arc_size drop below arc_c.
Until ARC reclaimed arc_c droped to arc_c_min, caused more and more reclaimed if arc_size fast enough reclaimed.
As result ZFS ARC size can be dramaticaly reduced, up to arc_min.
I can see 10x to 100x times needless arc reclaim.

  1. arc_shrink() hold freed memory in zone cache, this memory not available to system. Call arc_kmem_reap_now() for this.
  2. arc_kmem_reap_now() don't cleanup per-CPU zone cache. Implement uma_reclaim_zone_cache() and arc_drain_uma_cache() for this
  3. Many free memory immediatly available to ARC can be found in zone cache, check this for when call arc_reclaim_needed() for eleminate false memory pressure to ARC

Many thanks to markj, avg and mav for help, support and clarifications!

Tests and feedbacks welcome!

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Illumos update needfree by kmem susbsystem. On FreeBSD emulate this in next way:

  1. setting new target ARC size in arc_lowmem(): arc_target_size = arc_size - needfree * PAGESIZE);
  2. correct needfree in arc_available_memory(): needfree = btoc(arc_size - arc_target_size)
  3. Do less aggresive evict in arc_reclaim_thread(): arc_shrink() up to (arc_c >> arc_shrink_shift) only in case of have some free memory (current version in case of memory pressure shrink to needfree enhance by (arc_c >> arc_shrink_shift)
allanjude added a subscriber: allanjude.
allanjude added inline comments.
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
5446 ↗(On Diff #25108)

Are we sure it is safe to remove this comment and accompanying code?

slw_zxy.spb.ru added inline comments.Feb 13 2017, 8:47 PM
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
5446 ↗(On Diff #25108)

I am consalting w/ avg and yes, I am sure.
This code currently send all pressure to ARC, w/o any chance to do
some reclaim from inactive memory.
Removing this code release reclaiming from inactive memory and from
other consumers.

avg edited edge metadata.Feb 14 2017, 1:24 PM
  • I don't think that the needfree calculations in the patch are correct
    • in illumos needfree means something entirely different from what's calculated in the patch
    • there needs to be a very good explanation of why we would want to calculate needfree in the proposed fashion
  • I have no objections against the part of the patch which comes from illumos, but then we should just import that change (7504 kmem_reap hangs spa_sync and administrative tasks)
  • I am in favor of removing cv_wait from the ARC lowmem hook

On the last point. I haven't seen that problem reported recently, but in the past there were cases where the page daemon would invoke the lowmem hook, which would wake up the ARC reclaim thread and wait for it to signal back, the reclaim thread would try to make some evictions and in the process it would either ask for more memory or get stuck on a lock, which was held by another thread that tried to make a memory allocation while holding the lock (there are a few such places in the ARC code)... And the memory allocation would block forever because there is no agent that would free / reclaim any memory.
In general, because the ZFS code uses SX locks (sleepable), I am in favour of reducing a number of cases where the page daemon has to wait on the ZFS code.

In D7538#198085, @avg wrote:
  • I don't think that the needfree calculations in the patch are correct
    • in illumos needfree means something entirely different from what's calculated in the patch
    • there needs to be a very good explanation of why we would want to calculate needfree in the proposed fashion

Yes, in illumos needfree calculation is different. I am think this is unimportant for this: needfree mean as ammount of released memory from ARC. FreeBSD is different from illumous and we can use different calculation. In this patch this target is v_free_target at memory pressure and zfs_arc_free_target+v_free_reserved at normal run. This is not exact as illumous but like.

Big flaw different: ARC can have many free memory as zone free objects. And next:

  1. UMA don't have kernel interface for query this counts
  2. Direct calculation in arc.c is incorrect and pollution of UMA internals
  3. This is must be accounted only from arc adjust path for normal run
  4. This is don't be accounted under memory pressure.
  5. This cacluations may be expensive (many zones, for every zone need lock and loop over per-cpu cache).
  • there needs to be a very good explanation of why we would want to calculate needfree in the proposed fashion

main goal: don't allow drop freemem below v_free_min

avg added a comment.Feb 14 2017, 2:42 PM
  • there needs to be a very good explanation of why we would want to calculate needfree in the proposed fashion

main goal: don't allow drop freemem below v_free_min

I think that the existing check handles that already.

In D7538#198159, @avg wrote:
  • there needs to be a very good explanation of why we would want to calculate needfree in the proposed fashion

main goal: don't allow drop freemem below v_free_min

I think that the existing check handles that already.

existing check allow drop all ARC cache at memory pressure: after arc_lowmem activated needfree set to non-zero value and don't adjusted until (arc_size <= arc_c || evicted == 0).
This check executed after call to arc_adjust() and in some cases after every call will be arc_size>arc_c. In this cases ARC shrink to arc_c_min. Yes, I am see this behavior.

Existing check totaly wrong and unacceptable.

avg added a comment.Feb 26 2017, 5:54 PM

If anyone is curious, here is a real world example of a deadlock involving the page daemon sleeping in ARC vm_lowmem hook.

48240 102111 aiod27         aiod27               stack8 sched_switch+0x143 mi_switch+0x194 sleepq_wait+0x42 _sleep+0x3c5 vm_wait+0x75 uma_small_alloc+0x6a keg_alloc_slab+0x9b keg_fetch_slab+0xdf zone_fetch_slab+0x42 zone_import+0x34 zone_alloc_item+0x3d keg_alloc_slab+0x2ba keg_fetch_slab+0xdf zone_fetch_slab+0x42 zone_import+0x34 uma_zalloc_arg+0x2fc arc_get_data_buf+0x489 arc_buf_alloc+0xc6 arc_read+0x121 dbuf_read+0x5a7 dbuf_findbp+0x183 dbuf_hold_impl+0x186 dbuf_hold+0x1b dmu_buf_hold_array_by_dnode_line+0x21b dmu_buf_hold_array+0x59 dmu_read_uio+0x3f zfs_freebsd_read+0x6d9 VOP_READ_APV+0x6e vn_read+0xca vn_io_fault+0x159 aio_daemon+0x991 fork_exit+0x11f fork_trampoline+0xe
   37 100626 zfskern        arc_reclaim_thread   stack41 sched_switch+0x143 mi_switch+0x194 sleepq_wait+0x42 _sx_xlock_hard+0x56f _sx_xlock+0x7e arc_buf_remove_ref+0x8c dbuf_rele_and_unlock+0x10e dbuf_evict+0x11 dbuf_do_evict+0x56 arc_do_user_evicts+0xb4 arc_reclaim_thread+0x222 fork_exit+0x11f fork_trampoline+0xe
    7 100113 pagedaemon     pagedaemon           stack49 sched_switch+0x143 mi_switch+0x194 sleepq_wait+0x42 _sx_xlock_hard+0x56f _sx_xlock+0x7e arc_lowmem+0x3c vm_pageout+0x20b fork_exit+0x11f fork_trampoline+0x

This is with an older code base, but I believe that exactly the same can happen with the latest code too.

slw_zxy.spb.ru retitled this revision from Correct adaptation of needfree (from illumos) to FreeBSD (ZFS ARC memory pressure) to Correct adaptation ZFS ARC memory pressure to FreeBSD.
slw_zxy.spb.ru updated this object.
slw_zxy.spb.ru edited edge metadata.
  1. Check size of free items in zones (now eleminate false pressure to ARC by per-CPU zone cache)
  2. Do cleanup per-CPU zone cache after arc_kmem_reap_now() (immediatle available as free memory shrinked ARC cache)

fix typo again

per-CPU cache must be drained before zone drain

Optimize zone processing

Generalize zone travers algorithm and do some optimization on cache reclaim (arc_kmem_reap_now() costly and call only as last resort)

Fix userland compilation

  1. Restore lost code
  2. Conrtol (by sysctl) draining per-CPU UMA cache
  3. Skip first draining UMA cache for case of enough memory in caches.
  4. Don't wait drained memory accounted as free. Stop daining if expect as enough.

somehow this patch solved my old problem - instant hangs somewhere in building libclang/libllvm while making buildworld.
my configuration: 11stable x64 on vmware vm , zfs (default layout by bsdinstall) 4 or 6G memory (no visible difference, no explicit prefetch on 4G setup), 4core parallel build
top usually reports something like "2G free" (until entire vm become nonresponsible)

note what building nowdays world on this vm is not disk or memory intensive task at all, it's limited solely by cpu.

As some of you probably know I've been chasing this same general issue here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594

I'm playing with this patch set now on 11.1-STABLE (r324056) and other than leaving a crazy amount of inactive pages outstanding (which never get reclaimed in many instances, thus pressuring ARC size to half or so of what it could otherwise be) it appears to behave well.

I think I've got a fix for that last issue, and this patch set is a more-elegant approach to the UMA bloat problem then I had come up with. I want to run my changes to this code for a few days before contributing my thoughts in the form of code, but the short version is that adding a pager wakeup somewhat above the low-memory threshold appears to resolve the "frozen" inactive page issue and, if that proves up, this patch set looks very good and somewhat-superior to the one I have been running for a while (and thus a better option.)

As some of you probably know I've been chasing this same general issue here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594

I'm playing with this patch set now on 11.1-STABLE (r324056) and other than leaving a crazy amount of inactive pages outstanding (which never get reclaimed in many instances, thus pressuring ARC size to half or so of what it could otherwise be) it appears to behave well.

I think I've got a fix for that last issue, and this patch set is a more-elegant approach to the UMA bloat problem then I had come up with. I want to run my changes to this code for a few days before contributing my thoughts in the form of code, but the short version is that adding a pager wakeup somewhat above the low-memory threshold appears to resolve the "frozen" inactive page issue and, if that proves up, this patch set looks very good and somewhat-superior to the one I have been running for ofa while (and thus a better option.)

Good news. I am think you work on the other side big problem -- about pressuring to ARC. My patch is about ARC reacting to pressuring (current reacting is totaly bad). I am guess best is apply both patch. Can you check this?

Definitely; I have the patch here in along with additional changes derived from my previous work running a soak test; should have some commentary and perhaps additional suggestions in a few days.

Being new to phabricator.... how do I upload a diff that is off a (probably) different base rev and has my changes in it (I don't really want to "update" the existing diff, or do I?)

Being new to phabricator.... how do I upload a diff that is off a (probably) different base rev and has my changes in it (I don't really want to "update" the existing diff, or do I?)

It would be better handled by slw_zxy.spb.ru (for consistency with original diff).

slw_zxy.spb.ru , can you handle it if Karl uploads updated patch (for stable/11) to bugzilla (alrealy noted by Karl)?
If Karl's updates are arc.c specific, I think the differences would be relatively small.

Comparing my latest (and not uploaded to bugzilla due to insfficient test) minimally-fixed Karl's patches for head (r320262)
and stable/11 (r321610), the difference was just the position of abd_chunk_cache declaration.

Anyway, although I haven't experienced yet, regular procedure is described at the page below, which would require head src tree locally. [1]

[1] https://wiki.freebsd.org/Phabricator

Confirmed that the latest patch Karl uploaded on Bugzilla bug 187594 [1] (merged his patch with the one here) is...

*Applicable to head at r324080. Cleanly applicable to arc.c and applicable with some offsets to VM portion.
*Builds / installs fine on amd64.
*Boots normally.

Not yet stress-tested, though.

slw_zxy.spb.ru , can you try the latest one, and if OK, update diff here?
Beware! Karl's patch should applied on /usr/src/sys, not on /usr/src.

[1] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594

devnull_e-moe.ru added a comment.EditedOct 4 2017, 9:24 AM

"works for me"
(survived make buildworld two times in a row ;-) kernel from clean sources hangs, as usual, somewhere in stage3/clang)

emaste added a subscriber: emaste.Nov 6 2017, 5:07 PM

Just a heads-up.
r325851 broke this (including Karl's updated patch at bug 187594).
"needfree" in arc.c has gone.

Well that's a bitch.

It's easily fixed but I'm not going to be doing much with -HEAD until it
stabilizes and gets closer to being -RELEASE.

Well that's a bitch.

It's easily fixed but I'm not going to be doing much with -HEAD until it
stabilizes and gets closer to being -RELEASE.

Karl: I know you have been chasing this for a long time, I am sorry we have not been more responsive. The needfree change is part of this effort, it ensures that the ARC doesn't keep shrinking for no reason and end up too small when there is lots of free memory.

If you could get your patch to apply to head again, I've gotten one of our VM experts to agree to review it (I twisted his arm in person at a conference last weekend)

I'll try to get some time allocated to that - right now the machine that
I use for this effort is busy doing other things but I may be able to
free up some time on it in the next week or two.

Is there any intent to backport/MFC those changes into 11-STABLE?  That
I can probably test and adapt the code to more-quickly.

The needfree change is a good one, it's just that having it confined to
-HEAD causes problems for me right now due to "free computer of
appropriate configuration" constraints.

Thanks in advance.

I'll try to get some time allocated to that - right now the machine that
I use for this effort is busy doing other things but I may be able to
free up some time on it in the next week or two.

Is there any intent to backport/MFC those changes into 11-STABLE?  That
I can probably test and adapt the code to more-quickly.

The needfree change is a good one, it's just that having it confined to
-HEAD causes problems for me right now due to "free computer of
appropriate configuration" constraints.

Thanks in advance.

I would expect the latest ZFS code to be merged from -current to stable/11 in time for 11.2-RELEASE

cy added a subscriber: cy.Dec 2 2017, 6:00 AM

I'll try to get some time allocated to that - right now the machine that
I use for this effort is busy doing other things but I may be able to
free up some time on it in the next week or two.

Is there any intent to backport/MFC those changes into 11-STABLE?  That
I can probably test and adapt the code to more-quickly.

The needfree change is a good one, it's just that having it confined to
-HEAD causes problems for me right now due to "free computer of
appropriate configuration" constraints.

Thanks in advance.

I would expect the latest ZFS code to be merged from -current to stable/11 in time for 11.2-RELEASE

Just a bit too late heads-up.
r325851 is MFC'ed to stable/11 as r326619.
So the patch here is no longer applicable to stable/11, too. :-(

devnull_e-moe.ru added a comment.EditedDec 19 2017, 10:56 AM

just for whom it may concern: I've rolled back r326619 and reapplied slw patch on top - again, on "works for me" basics.

I personally prefer working re-implementation instead of bluntly deleting code.

- this is a full diff against stable/11 r326974 with Karl's part included. works on relatively highly overloaded vm for a week, so I decided to put it here for others convenience.

Update to latest -STABLE changes

swills added a subscriber: swills.May 22 2018, 1:30 PM
markj added a subscriber: markj.May 22 2018, 1:56 PM
lev added a subscriber: lev.May 22 2018, 3:19 PM

It is only solution to live-lock problem I encounter on my server when there are massive-parallel fast download.

Unified for -stable and current now

I have applied this patch to stable/11 r334820 and have been running without issue for 10 days.

In this time I have seen wired amounts being released more often than previoulsy, without any interaction being needed.

I have come to realise that there is another issue related to this, the default arc_max being wired ram that is not counted in max_wired means a default setup is allowed to wire more than the physical ram installed.

See my comment here for more explanation.

I have come to realise that there is another issue related to this, the default arc_max being wired ram that is not counted in max_wired means a default setup is allowed to wire more than the physical ram installed.

See my comment here for more explanation.

Thanks for your feedback!
I am looking at your comment for PR163461. 70-80% of wired memory don't have direct relation to system slowdown or hard reset (for example "Mem: 3107M Active, 10G Inact, 231G Wired, 42K Buf, 5802M Free" and no slowdown), thes symptomps are caused by very low free_memory.
In my patch I try to convert unused wired ARC memory to system-wide free memory early.

I am looking at your comment for PR163461. 70-80% of wired memory don't have direct relation to system slowdown or hard reset (for example "Mem: 3107M Active, 10G Inact, 231G Wired, 42K Buf, 5802M Free" and no slowdown), thes symptomps are caused by very low free_memory.
In my patch I try to convert unused wired ARC memory to system-wide free memory early.

The main point in that comment is that arc_max does not consider max_wired and together they need to add up to less than the physical ram, this is not currently enforced which means you can get more ram wired than you have physical ram and that is a situation the system cannot recover from. I am thinking that arc_max should be prevented from being set to more than (physical ram - max_wired - padding) to prevent all ram being wired. But it looks like there is still something else that can wire ram, with max_wired at 5G and arc_max at 4G I had wired count of 9.9G a couple of hours ago.

The percentages probably apply more to lower ram systems, when I had 8G and wired got close to 6G I noticed the slowdown, this could be in part from used ram getting swapped out due to high wired amount, if I didn't respond quick enough and it climbed over 7G I rarely got a chance to do anything before it lead to a complete lockup, in the early stages I had left it for over 10 minutes before resetting, after a while I stopped waiting to see if it would recover. This is my desktop that I sit at all day, sometimes I could get wired>7G two or three times in a day and others times I could go nearly a week before it happened, I just couldn't get any steps that always reproduced it.

For a few years I have been using a script that allocates a big chunk of ram which leads to high wired amounts being released. Now I have 16G and for about six months I have automated the wired monitoring and have kept it below 10G.

I had been reluctant to say the wired amounts were arc related as the arc total was often reduced before the wired total when I over allocate ram to push it out, this may also be a matter of delay between counts being updated. What I have realised is that arc is wired and added to the wired count but is not included in the max_wired test, previously I had been thinking that max_wired was being ignored. The max_wired + arc_max needs to be kept less than physical ram to prevent all ram being wired, I think this is the real issue that I have been trying to grasp with this.

koobs added a subscriber: koobs.Jul 7 2018, 3:54 AM

Hi there,

I've been using this patch for 10 days and my issues have gone as described on "lightly loaded system eats swap space" thread on freebsd-stable mailing list.

Thanks.

Hi

I can confirm that the patch solves the issues as described in the "lightly loaded system eats swap space" thread on freebsd-stable mailing list.
On one of my desktop systems the issue is particularly noticeable as the disk subsystem is rather slow, the system became unusable after a
weekend sitting idle. I have not experienced the issues anymore after applying the patch

Thanks to everybody

mmacy added a reviewer: jeff.Aug 7 2018, 9:26 PM
mmacy added a subscriber: mmacy.

Sorry that this review has stalled lately. I would like to compare this patch to what's in -CURRENT, which has evolved a fair bit since the patch was updated. Once that picture is more clear, we can focus on stable/11.

I have tried to read through everything, but I will inevitably repeat some observations and miss others. Sorry in advance. Some comments on the patch:

  • needfree has been removed entirely in r325851.
  • The FMR_NEEDFREE target in -CURRENT is now the same as in this patch (though I didn't realize that when I made the change). See r332365.
  • A regression affecting 11.2 which caused the ARC to shrink very drastically was fixed in r338142. It seems this patch worked around that regression by reaping the kmem caches after each call to arc_adjust.

Given the last point, my impression is that -CURRENT already contains the bulk of this change. The main difference is that this patch also drains the per-CPU caches. However, the per-CPU bucket size is quite small for any of the zio buf zones, so I'm skeptical that that part of the change is very important. Does anyone running ZFS on -CURRENT post r338142 observe any problems that are fixed by this patch?

In D7538#358922, @markj wrote:

Sorry that this review has stalled lately. I would like to compare this patch to what's in -CURRENT, which has evolved a fair bit since the patch was updated. Once that picture is more clear, we can focus on stable/11.

np

I have tried to read through everything, but I will inevitably repeat some observations and miss others. Sorry in advance. Some comments on the patch:

  • The FMR_NEEDFREE target in -CURRENT is now the same as in this patch (though I didn't realize that when I made the change). See r332365.

Are you sure? A im still see 'r = FMR_LOTSFREE;' at line 4624

r332365: Discussed. Not very impotant. I am don't touch this is patch, may be zfs_arc_free_target is badly nameed (this target used as zfs_arc_min_free_target and I am use for my setup v_free_targetx1.4).
Any way, this is sysctl and can be tuned.

  • A regression affecting 11.2 which caused the ARC to shrink very drastically was fixed in r338142. It seems this patch worked around that regression by reaping the kmem caches after each call to arc_adjust.

ARC to shrink very drastically by staled 'needfree' and too late cache reap. Yes, after arc_adjust we can have too many free memory in kmem caches (I am see 20..30GB) and no free memory avaliable to system and still under low memory pressure.
Not each call to arc_adjust cause reaping the kmem caches, only arc_adjust under memory pressure (needfree used also for track this).
Before call to arc_adjust under memory pressue try to reaping the kmem caches as more fast and cheap solution.

  • needfree has been removed entirely in r325851.

I think it is wrong to remove needfree and restore needfree w/ different logic.
needfree used to track ammount of initially memory deficit and track event of lowmem (we need to free lot of memory no matter what other subsystems do)

Given the last point, my impression is that -CURRENT already contains the bulk of this change. The main difference is that this patch also drains the per-CPU caches. However, the per-CPU bucket size is quite small for any of the zio buf zones, so I'm skeptical that that part of the change is very important. Does anyone running ZFS on -CURRENT post r338142 observe any problems that are fixed by this patch?

No. Some logic from patch mised:

  • calc free memory in kmem cache and use it for check ARC grow
  • reaping the kmem caches before ARC shrink
  • reaping from bigest zone, stop reaping after reaching target
  • force reaping after ARC shrink
  • eleiminate locking (and deadlock) in arc_lowmem()

Thank you for a review!

markj added a comment.Aug 23 2018, 3:12 PM
In D7538#358922, @markj wrote:
  • The FMR_NEEDFREE target in -CURRENT is now the same as in this patch (though I didn't realize that when I made the change). See r332365.

Are you sure? A im still see 'r = FMR_LOTSFREE;' at line 4624

r332365: Discussed. Not very impotant. I am don't touch this is patch, may be zfs_arc_free_target is badly nameed (this target used as zfs_arc_min_free_target and I am use for my setup v_free_targetx1.4).
Any way, this is sysctl and can be tuned.

To be clear, I'm just stating that r332365 changed zfs_arc_free_target to be equal vm_cnt.v_free_target. It looks to me that this is equivalent to the change you made to arc_available_memory(EXCLUDE_ZONE_CACHE), where v_free_target is referenced directly.

  • A regression affecting 11.2 which caused the ARC to shrink very drastically was fixed in r338142. It seems this patch worked around that regression by reaping the kmem caches after each call to arc_adjust.

ARC to shrink very drastically by staled 'needfree' and too late cache reap. Yes, after arc_adjust we can have too many free memory in kmem caches (I am see 20..30GB) and no free memory avaliable to system and still under low memory pressure.
Not each call to arc_adjust cause reaping the kmem caches, only arc_adjust under memory pressure (needfree used also for track this).
Before call to arc_adjust under memory pressue try to reaping the kmem caches as more fast and cheap solution.

  • needfree has been removed entirely in r325851.

I think it is wrong to remove needfree and restore needfree w/ different logic.
needfree used to track ammount of initially memory deficit and track event of lowmem (we need to free lot of memory no matter what other subsystems do)

Ok, I need to study needfree more closely.

Given the last point, my impression is that -CURRENT already contains the bulk of this change. The main difference is that this patch also drains the per-CPU caches. However, the per-CPU bucket size is quite small for any of the zio buf zones, so I'm skeptical that that part of the change is very important. Does anyone running ZFS on -CURRENT post r338142 observe any problems that are fixed by this patch?

No. Some logic from patch mised:

  • calc free memory in kmem cache and use it for check ARC grow

BTW, why does uma_zone_get_free_size() ignore items in the bucket cache, uz_buckets? They are not counted in uk_free.

  • reaping the kmem caches before ARC shrink

We do that now.

  • reaping from bigest zone, stop reaping after reaching target

Ok. It seems this is an optimization and we can think about it separately. Do you agree? See also D16667, which is the UMA patch we discussed a long time ago.

  • force reaping after ARC shrink
  • eleiminate locking (and deadlock) in arc_lowmem()

For now I'd suggest handling the deadlock as a separate change. There are downsides to making arc_lowmem() non-blocking: the page daemon will adjust its target based on the number of pages freed by arc_lowmem(), so it assumes that arc_lowmem() will free pages before returning. If it is non-blocking, the page daemon will be too aggressive in some cases, and this will result in excess swapping.

Instead, we could make arc_lowmem() non-blocking in extreme situations, i.e.,

if (curproc == pageproc && !vm_page_count_min())
    cv_wait();
return;

What do you think?

To be clear, I'm just stating that r332365 changed zfs_arc_free_target to be equal vm_cnt.v_free_target. It looks to me that this is equivalent to the change you made to arc_available_memory(EXCLUDE_ZONE_CACHE), where v_free_target is referenced directly.

No.

arc_available_memory(EXCLUDE_ZONE_CACHE) check conditions for memory pressure, check how many free memory see by OS (and kmem cache not counted for this).
arc_available_memory(INCLUDE_ZONE_CACHE) check conditions for ARC grow and include in free memory kmem cache.
Not related to v_free_target directly.

Given the last point, my impression is that -CURRENT already contains the bulk of this change. The main difference is that this patch also drains the per-CPU caches. However, the per-CPU bucket size is quite small for any of the zio buf zones, so I'm skeptical that that part of the change is very important. Does anyone running ZFS on -CURRENT post r338142 observe any problems that are fixed by this patch?

No. Some logic from patch mised:

  • calc free memory in kmem cache and use it for check ARC grow

BTW, why does uma_zone_get_free_size() ignore items in the bucket cache, uz_buckets? They are not counted in uk_free.

Just complex in implementation for me (and I am not sure about understud of uz_buckets).
I will glad to see your implementation

  • reaping the kmem caches before ARC shrink

We do that now.

  • reaping from bigest zone, stop reaping after reaching target

Ok. It seems this is an optimization and we can think about it separately. Do you agree? See also D16667, which is the UMA patch we discussed a long time ago.

From my [current] point we need level of reclamation limited by time (time of zone lock) and leave in kmem cache some buckets (may be based on statistics of cache?).
And call this at regular interval: not direct related to ARC, under high network load I am see lots of memory in mbuf kmem cache (about 30GB). This "free" memory not available to ARC cache and ARC hit ratio fall.
"reclaim" from D16667?

  • force reaping after ARC shrink
  • eleiminate locking (and deadlock) in arc_lowmem()

For now I'd suggest handling the deadlock as a separate change. There are downsides to making arc_lowmem() non-blocking: the page daemon will adjust its target based on the number of pages freed by arc_lowmem(), so it assumes that arc_lowmem() will free pages before returning. If it is non-blocking, the page daemon will be too aggressive in some cases, and this will result in excess swapping.

Instead, we could make arc_lowmem() non-blocking in extreme situations, i.e.,

if (curproc == pageproc && !vm_page_count_min())
    cv_wait();
return;

What do you think?

Under burst memory pressure (40G/100G network burst, as example) we can call cv_wait() and at once after this exhaust free memory below vm_page_count_min.
This caused very high performance drop (and slowed ARC reclaim).
ARC reclaim is very slow response and any lock in this path can cause deadlock.
In this case aggressive page daemon is the lesser evil.

markj added a comment.Aug 28 2018, 9:04 PM

To be clear, I'm just stating that r332365 changed zfs_arc_free_target to be equal vm_cnt.v_free_target. It looks to me that this is equivalent to the change you made to arc_available_memory(EXCLUDE_ZONE_CACHE), where v_free_target is referenced directly.

No.

arc_available_memory(EXCLUDE_ZONE_CACHE) check conditions for memory pressure, check how many free memory see by OS (and kmem cache not counted for this).

Yes, which is exactly what the computation freemem - zfs_arc_free_target is. If you expand these definitions, it is vm_cnt.v_free_count - vm_cnt.v_free_target, where v_free_count does not include UMA caches. When v_free_count < v_free_target, the system is under memory pressure, and the page daemon attempts to free pages until v_free_count >= v_free_target. In -CURRENT, you can think of needfree as being the same as v_free_target - v_free_target when this difference is positive. In stable branches this is not quite true.

Some logic from patch mised:

  • reaping from bigest zone, stop reaping after reaching target

Ok. It seems this is an optimization and we can think about it separately. Do you agree? See also D16667, which is the UMA patch we discussed a long time ago.

From my [current] point we need level of reclamation limited by time (time of zone lock) and leave in kmem cache some buckets (may be based on statistics of cache?).

That's what the "trim" request in D16667: we can ask each UMA zone/kmem cache to release a fraction of their buckets. The amount released depends on how busy the zone was in the last couple of minutes.

And call this at regular interval: not direct related to ARC, under high network load I am see lots of memory in mbuf kmem cache (about 30GB). This "free" memory not available to ARC cache and ARC hit ratio fall.
"reclaim" from D16667?

Hmm. This is somewhat complicated to explain because of differences between stable/11 and head/. In particular, the page daemon behaviour has changed, and your patch also changes the interactions between the page daemon and the ARC.

On stable/11, the page daemon's behaviour is pretty simple. It has some thresholds: v_free_target, vm_pageout_wakeup_thresh and v_free_min, with v_free_min < vm_pageout_wakeup_thresh < v_free_target. vm_pageout_wakeup_thresh is slightly larger than v_free_min. When v_free_count becomes smaller than vm_pageout_wakeup_thresh, the page daemon is woken up and frees pages until v_free_count >= v_free_target. It will occasionally call lowmem handlers and uma_reclaim(), but no more than once every ten seconds. One important note is that v_free_target - vm_pageout_wakeup_thresh becomes quite large on system with lots of memory.

On head/, the behaviour is changed somewhat. The page daemon compares v_free_target and v_free_count every 0.1s. If it wakes up and observes v_free_count < v_free_target, it will reclaim pages from the inactive queue. Like before, it will call lowmem handlers and uma_reclaim() at most once every ten seconds. However, it is now more aggressive to respond to memory pressure, since it sleeps for only 0.1s instead of waiting for the condition v_free_count < vm_pageout_wakeup_thresh to become true.

My suspicion is that with your patch, the condition v_free_count < vm_pageout_wakeup_thresh is never true because the ARC always releases memory first. Thus, uma_reclaim() isn't getting called, so the mbuf zones are full of unused memory. You can check this claim by looking at the "page daemon wakeups" line in vmstat -s output. On -CURRENT the situation is different, and I suspect that you will not see the giant mbuf cluster zones: the page daemon is waking up sooner and will likely end up calling uma_reclaim() periodically. I think D16667 is mostly unrelated to this problem.

Have you tested plain -CURRENT with your workload? I do not claim it will solve all of the problems fixed by this patch, but it will be easier for us to diagnose and fix the remaining issues if you are willing to try again.

  • force reaping after ARC shrink
  • eleiminate locking (and deadlock) in arc_lowmem()

For now I'd suggest handling the deadlock as a separate change. There are downsides to making arc_lowmem() non-blocking: the page daemon will adjust its target based on the number of pages freed by arc_lowmem(), so it assumes that arc_lowmem() will free pages before returning. If it is non-blocking, the page daemon will be too aggressive in some cases, and this will result in excess swapping.

Instead, we could make arc_lowmem() non-blocking in extreme situations, i.e.,

if (curproc == pageproc && !vm_page_count_min())
    cv_wait();
return;

What do you think?

...
In this case aggressive page daemon is the lesser evil.

I don't really agree with the last statement. Removing the cv_wait() entirely changes behaviour for all workloads. I believe it will cause excessive swapping on desktop systems which use ZFS.

In D7538#361135, @markj wrote:

To be clear, I'm just stating that r332365 changed zfs_arc_free_target to be equal vm_cnt.v_free_target. It looks to me that this is equivalent to the change you made to arc_available_memory(EXCLUDE_ZONE_CACHE), where v_free_target is referenced directly.

No.

arc_available_memory(EXCLUDE_ZONE_CACHE) check conditions for memory pressure, check how many free memory see by OS (and kmem cache not counted for this).

Yes, which is exactly what the computation freemem - zfs_arc_free_target is. If you expand these definitions, it is vm_cnt.v_free_count - vm_cnt.v_free_target, where v_free_count does not include UMA caches. When v_free_count < v_free_target, the system is under memory pressure, and the page daemon attempts to free pages until v_free_count >= v_free_target. In -CURRENT, you can think of needfree as being the same as v_free_target - v_free_target when this difference is positive. In stable branches this is not quite true.

Sorry, I am miss you point.
arc_available_memory have different behaivor (EXCLUDE/INCLUDE ZONE_CACHE), this is can't be match to r332365.
zfs_arc_free_target is sysctl changable and I am change to some value more close to r332365 via /etc/sysctl.conf.
Realy optimal value of zfs_arc_free_target is too complex and depends of other workload (ex: net burst) and ARC shrink speed (target: don't allow drop vm_cnt.v_free_count too low before ARC shrink and redeem memory to system)

Some logic from patch mised:

  • reaping from bigest zone, stop reaping after reaching target

Ok. It seems this is an optimization and we can think about it separately. Do you agree? See also D16667, which is the UMA patch we discussed a long time ago.

From my [current] point we need level of reclamation limited by time (time of zone lock) and leave in kmem cache some buckets (may be based on statistics of cache?).

That's what the "trim" request in D16667: we can ask each UMA zone/kmem cache to release a fraction of their buckets. The amount released depends on how busy the zone was in the last couple of minutes.

my point: do not spend too much time in this path.

And call this at regular interval: not direct related to ARC, under high network load I am see lots of memory in mbuf kmem cache (about 30GB). This "free" memory not available to ARC cache and ARC hit ratio fall.
"reclaim" from D16667?

Hmm. This is somewhat complicated to explain because of differences between stable/11 and head/. In particular, the page daemon behaviour has changed, and your patch also changes the interactions between the page daemon and the ARC.

On stable/11, the page daemon's behaviour is pretty simple. It has some thresholds: v_free_target, vm_pageout_wakeup_thresh and v_free_min, with v_free_min < vm_pageout_wakeup_thresh < v_free_target. vm_pageout_wakeup_thresh is slightly larger than v_free_min. When v_free_count becomes smaller than vm_pageout_wakeup_thresh, the page daemon is woken up and frees pages until v_free_count >= v_free_target. It will occasionally call lowmem handlers and uma_reclaim(), but no more than once every ten seconds. One important note is that v_free_target - vm_pageout_wakeup_thresh becomes quite large on system with lots of memory.

On head/, the behaviour is changed somewhat. The page daemon compares v_free_target and v_free_count every 0.1s. If it wakes up and observes v_free_count < v_free_target, it will reclaim pages from the inactive queue. Like before, it will call lowmem handlers and uma_reclaim() at most once every ten seconds. However, it is now more aggressive to respond to memory pressure, since it sleeps for only 0.1s instead of waiting for the condition v_free_count < vm_pageout_wakeup_thresh to become true.

My suspicion is that with your patch, the condition v_free_count < vm_pageout_wakeup_thresh is never true because the ARC always releases memory first.

My patch also remove wait in lowmem(). ARC release memory too slow (by complex algo) and until ARC reclaim started and do many work page daemon can reclaim memory from other targets in concurent way.
For this good condition zfs_arc_free_target < v_free_target and use needfree as latch of initial memory deficit.

Thus, uma_reclaim() isn't getting called, so the mbuf zones are full of unused memory. You can check this claim by looking at the "page daemon wakeups" line in vmstat -s output. On -CURRENT the situation is different, and I suspect that you will not see the giant mbuf cluster zones: the page daemon is waking up sooner and will likely end up calling uma_reclaim() periodically. I think D16667 is mostly unrelated to this problem.

No, lots of memory in mbuf cache independ of memory pressure/page daemon in my workload:

  1. net workload raise
  2. mbuf consumption rise
  3. free memory reduction
  4. ARC reacts to free memory reduction by ARC reduction
  5. net workload reduction started
  6. mbuf released to zone cache
  7. free memory don't raised
  8. ARC don't rised, memory pressure does not arise, page daemon not activated.

Have you tested plain -CURRENT with your workload? I do not claim it will solve all of the problems fixed by this patch, but it will be easier for us to diagnose and fix the remaining issues if you are willing to try again.

Not in near future.
May be Lev Serebryakov do, as I know you interact him. Work load different, but some behaivor similar

In this case aggressive page daemon is the lesser evil.

I don't really agree with the last statement. Removing the cv_wait() entirely changes behaviour for all workloads. I believe it will cause excessive swapping on desktop systems which use ZFS.

With cv_wait() all memory will be reclaimed from ARC, none from other targets.
W/o cv_wait() in some cases may be swaping slight rised, but immidiatly after this memory will be rebalanced between ARC and other consumers.

Thank for ineraction!

markj added a comment.Aug 29 2018, 7:37 PM
In D7538#361135, @markj wrote:

No, lots of memory in mbuf cache independ of memory pressure/page daemon in my workload:

  1. net workload raise
  2. mbuf consumption rise
  3. free memory reduction
  4. ARC reacts to free memory reduction by ARC reduction
  5. net workload reduction started
  6. mbuf released to zone cache
  7. free memory don't raised
  8. ARC don't rised, memory pressure does not arise, page daemon not activated.

The ARC is not growing after 8, but the ARC hit rate is too low. Why is it not growing? Is it because the free_memory < (arc_c >> arc_no_grow_shift) condition is true, or is there some other reason?

Have you tested plain -CURRENT with your workload? I do not claim it will solve all of the problems fixed by this patch, but it will be easier for us to diagnose and fix the remaining issues if you are willing to try again.

Not in near future.
May be Lev Serebryakov do, as I know you interact him. Work load different, but some behaivor similar

Thanks.

In this case aggressive page daemon is the lesser evil.

I don't really agree with the last statement. Removing the cv_wait() entirely changes behaviour for all workloads. I believe it will cause excessive swapping on desktop systems which use ZFS.

With cv_wait() all memory will be reclaimed from ARC, none from other targets.

This is not always true. It depends on the size of the ARC relative to the system's memory.

W/o cv_wait() in some cases may be swaping slight rised, but immidiatly after this memory will be rebalanced between ARC and other consumers.

I agree that we need to move in this direction (asynchronous reclamation of ARC and kmem caches), but some careful thought needs to be given to how this will interact with the page daemon.

  1. ARC don't rised, memory pressure does not arise, page daemon not activated.

The ARC is not growing after 8, but the ARC hit rate is too low. Why is it not growing? Is it because the free_memory < (arc_c >> arc_no_grow_shift) condition is true, or is there some other reason?

Yes, because the free_memory < (arc_c >> arc_no_grow_shift) condition is true

Have you tested plain -CURRENT with your workload? I do not claim it will solve all of the problems fixed by this patch, but it will be easier for us to diagnose and fix the remaining issues if you are willing to try again.

Not in near future.
May be Lev Serebryakov do, as I know you interact him. Work load different, but some behaivor similar

Thanks.

In this case aggressive page daemon is the lesser evil.

I don't really agree with the last statement. Removing the cv_wait() entirely changes behaviour for all workloads. I believe it will cause excessive swapping on desktop systems which use ZFS.

With cv_wait() all memory will be reclaimed from ARC, none from other targets.

This is not always true. It depends on the size of the ARC relative to the system's memory.

After locking in arc_lowmem ARC reclaim stoped only after reached target freemem.
After released locking in arc_lowmem no more work for free memory for other zones/subsystems, right?

W/o cv_wait() in some cases may be swaping slight rised, but immidiatly after this memory will be rebalanced between ARC and other consumers.

I agree that we need to move in this direction (asynchronous reclamation of ARC and kmem caches), but some careful thought needs to be given to how this will interact with the page daemon.

This is complex question, I agree.