Page MenuHomeFreeBSD

Improved i386 support in arc_memory_throttle
ClosedPublic

Authored by smh on Aug 31 2014, 2:37 PM.
Tags
None
Referenced Files
F104088590: D702.id1296.diff
Tue, Dec 3, 9:47 AM
Unknown Object (File)
Mon, Dec 2, 3:31 PM
Unknown Object (File)
Sat, Nov 30, 1:39 AM
Unknown Object (File)
Fri, Nov 29, 6:42 PM
Unknown Object (File)
Thu, Nov 28, 2:34 AM
Unknown Object (File)
Fri, Nov 22, 7:56 AM
Unknown Object (File)
Thu, Nov 21, 4:39 AM
Unknown Object (File)
Mon, Nov 18, 4:41 PM
Subscribers

Details

Reviewers
peter
avg
Summary

i386 has KVA limits that the raw page counts above don't consider improve logic in arc_memory_throttle to take this into account

Also rename kmem_used -> kmem_map_used for consitency with VM name and to avoid confusion

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

smh retitled this revision from to Improved i386 support in arc_memory_throttle.
smh updated this object.
smh edited the test plan for this revision. (Show Details)
smh added a reviewer: peter.

Convert pages to bytes for i386 available_memory test

Make curproc page_load check closer to Solaris now we have kmem_free_min()

Standardise on the ZFS ptob instead of ptob and remove unneeded casts.

Hmm. I find this whole section of code troubling.

First we have the physical page pool wrappers:
kmem_free_min() => wrapper around physical page free low limit
kmem_free_count() => wrapper around physical pages free
kmem_page_count() => wrapper around physical number of pages
where kmem implies kva or kernel memory, but it's not.. it is the physical page pool rather than kmem.

On the other hand there's the actual kva/kernel memory wrappers:
kmem_map_used() => wrapper around kva malloc arena reservation usage.
kmem_size() => MIN(physical page count, sizeof(kva malloc arena reservation))

The last one is special as it's both kmem and non-kmem at the same time.

At a glance, I can't tell which is which any more. There separation of kmem_map_ vs kmem_page_* sort of helps, but it is still rather inconsistent, eg: kmem_free_count() which is non-kmem physical pages, but not called kmem_page_*.

Are you sure you want to get into this? :)

The #ifdef i386 block (and addition of the kmem_map_free() function) looks about right though.

The two changes that implement the title of the review seem fine but the unrelated parts give me pause - more work is needed there. .. probably shouldn't mix two issues into one.

Thanks peter, if its wrong lets get if fixed so its clear for the future.

I'll keep it all in this one review for now and split if out if you think that's needed on commit.

smh edited edge metadata.

Rename kmem_free_* -> kmem_page_free_* so their meaning is clearer.

Ok so thats the functions you mentioned renamed so their meaning is clearer, with the exception of kmem_size().

Its used in the following locations:-

void
arc_init(void)
{
...
        /* Start out with 1/8 of all memory */
        arc_c = kmem_size() / 8;
...
        /*
         * Allow the tunables to override our calculations if they are
         * reasonable (ie. over 16MB)
         */
        if (zfs_arc_max > 64<<18 && zfs_arc_max < kmem_size())
                arc_c_max = zfs_arc_max;...
...
        if (kmem_size() < 512 * (1 << 20)) {
                printf("ZFS WARNING: Recommended minimum kmem_size is 512MB; "
                    "expect unstable behavior.\n");
                printf("             Consider tuning vm.kmem_size and "
                    "vm.kmem_size_max\n");
                printf("             in /boot/loader.conf.\n");
        }


static int
arc_reclaim_needed(void)
{
...
#ifdef __i386__
        /* i386 has KVA limits that the raw page counts above don't consider */
        if (kmem_map_used() > (kmem_size() * 3) / 4) {
                DTRACE_PROBE2(arc__reclaim_used, uint64_t,
                    kmem_used(), uint64_t, (kmem_size() * 3) / 4);
                return (1);
        }
#endif

So what do you think, the questions that spring to mind are:-

  • Does it need renaming?
  • Does it need changing in anyway?

Note, this may be more of a review for the code that was already committed rather than of the proposed changes.

First, as I've already said in an email, I suggest removing kmem_* or at the very least placing them into a different location.
They have no business being under /compat/.
Actually something like

#define freemem (vm_cnt.v_free_count + vm_cnt.v_cache_count)

could be useful in compat. This way some more of differences from the upstream code could be eliminated.

If we un-iline a couple of function is arc_reclaim_needed() then we will see the following code:

if (vm_cnt.v_free_count + vm_cnt.v_cache_count < zfs_arc_free_target) {
        DTRACE_PROBE2(arc__reclaim_freetarget, uint64_t,
            kmem_free_count(), uint64_t, zfs_arc_free_target);
        return (1);
}

/*
 * Cooperate with pagedaemon when it's time for it to scan
 * and reclaim some pages.
 */
if (vm_cnt.v_free_count + vm_cnt.v_cache_count < vm_pageout_wakeup_thresh) {
        DTRACE_PROBE(arc__reclaim_paging);
        return (1);
}

two checks above can be combined as follows

if (vm_cnt.v_free_count + vm_cnt.v_cache_count < zfs_arc_free_target ||
    vm_cnt.v_free_count + vm_cnt.v_cache_count < vm_pageout_wakeup_thresh)
        return 1;

which is equivalent to:

if (vm_cnt.v_free_count + vm_cnt.v_cache_count < MAX(zfs_arc_free_target, vm_pageout_wakeup_thresh))

using the default value of zfs_arc_free_target:

if (vm_cnt.v_free_count + vm_cnt.v_cache_count < MAX(vm_cnt.v_free_target, vm_pageout_wakeup_thresh))

Given that v_free_target is much larger than vm_pageout_wakeup_thresh on an auto-tuned system the above reduction shows that:

  • introduction of the zfs_arc_free_target check completely obliviated the vm_paging_needed() but in a rather obscure way
  • ARC now would start to give up memory much sooner than it used to, I do not think that this is a universally good and acceptable change

Finally, I do not think that making zfs_arc_free_target configurable via sysctl actually buys us anything.
VM and pageout behavior is quite complex, with many knobs. Requiring a ZFS user to have any understanding of that is not a good idea, IMO.
And without the understanding the users would just blindly play with unknown controls. I foresee appearance of misguided how-tos, lots of requests for help
and frustration from misconfigured systems.

introduction of the zfs_arc_free_target check completely obliviated the vm_paging_needed() but in a rather obscure way

Actually this was already the case with the original code:

if (kmem_map_used() > (kmem_size() * 3) / 4) {

ARC now would start to give up memory much sooner than it used to, I do not think that this is a universally good and acceptable change

The opposite is in fact true in my tests, before you'd start giving up memory at the 3/4 of kmem_size, which can be easily seen forcing reclaim at 40GB free on a 192GB machine.

Finally, I do not think that making zfs_arc_free_target configurable via sysctl actually buys us anything.

The example that springs to mind is increasing vfs.zfs.arc_free_target while keeping vm.v_free_target the same. This can be easier to manage for some use cases across varying hardware than having to set vfs.zfs.arc_max, something we've already made use of here.

Fix DTRACE probe call to renamed function

In D702#19, @smh wrote:

introduction of the zfs_arc_free_target check completely obliviated the vm_paging_needed() but in a rather obscure way

Actually this was already the case with the original code:

if (kmem_map_used() > (kmem_size() * 3) / 4) {

Actually, no. Those are two different tests because one checks physcial memory and the other checks KVA.

ARC now would start to give up memory much sooner than it used to, I do not think that this is a universally good and acceptable change

The opposite is in fact true in my tests, before you'd start giving up memory at the 3/4 of kmem_size, which can be easily seen forcing reclaim at 40GB free on a 192GB machine.

But it was not true on my 8GB machine.
And your example suggests that your memory pressure came from kernel memory usage (by ARC itself and perhaps other usages).
Whereas in my environments memory pressure typically comes from userland. So KVA usage is not greatly affected, but physical memory is competed for.

In either case, dropping the KVA check on amd64 is the correct part of your changes. I do not have anything against it.
I do object against the zfs_arc_free_target change.

Finally, I do not think that making zfs_arc_free_target configurable via sysctl actually buys us anything.

The example that springs to mind is increasing vfs.zfs.arc_free_target while keeping vm.v_free_target the same. This can be easier to manage for some use cases across varying hardware than having to set vfs.zfs.arc_max, something we've already made use of here.

No, one can not be a substitute for the other.

In D702#22, @avg wrote:
In D702#19, @smh wrote:

introduction of the zfs_arc_free_target check completely obliviated the vm_paging_needed() but in a rather obscure way

Actually this was already the case with the original code:

if (kmem_map_used() > (kmem_size() * 3) / 4) {

Actually, no. Those are two different tests because one checks physcial memory and the other checks KVA.

These do however have quite a tight correlation unless your manually tuning.

In D702#22, @avg wrote:
In D702#19, @smh wrote:

ARC now would start to give up memory much sooner than it used to, I do not think that this is a universally good and acceptable change

The opposite is in fact true in my tests, before you'd start giving up memory at the 3/4 of kmem_size, which can be easily seen forcing reclaim at 40GB free on a 192GB machine.

But it was not true on my 8GB machine.
And your example suggests that your memory pressure came from kernel memory usage (by ARC itself and perhaps other usages).

There was no memory pressure apart from ARC itself and this resulted in 40GB was sat in free, which is to be expected given the previous code.

Obviously this is fine if the machine is using a decent % of RAM for other things, be that other kernel allocations or user, but if not the more RAM the higher the wastage.

In D702#22, @avg wrote:
In D702#19, @smh wrote:

Finally, I do not think that making zfs_arc_free_target configurable via sysctl actually buys us anything.

The example that springs to mind is increasing vfs.zfs.arc_free_target while keeping vm.v_free_target the same. This can be easier to manage for some use cases across varying hardware than having to set vfs.zfs.arc_max, something we've already made use of here.

No, one can not be a substitute for the other.

Indeed it isn't a substitute, but it is an alternative which is easier to manage in some cases and also provides an additional way to tune ARC usage in a system which is ARC heavy for example, you may want to allow it to give up space more readily for other usages e.g. spiky apps if that's your workload.

I've knocked up the following spreadsheet to estimate the values in each case.
https://docs.google.com/spreadsheets/d/1mSL2aH72Vca-phBm7DpACZ-UDeXV15jx4goU1N8Xv-Q/edit?usp=sharing

Its clear from this that for high kmem consumers, be that ARC or other kernel use, that the old 1/4 mem is always going to be the reclaim trigger.

I think we all agree that said limit is of no real use on platforms that don't define UMA_MD_SMALL_ALLOC, so the question is do we keep the current limit of v_target_free or switch to vm_pageout_wakeup_thresh?

I've added the Solaris / illumos triggers as a useful comparison point, which shows that they trigger just over the values that the new v_target_free trigger provides which is why I believe its a good value to use, especially so given the feedback on the initial bug report.

My concern with removing the new check and relying on vm_pageout_wakeup_thresh as the trigger is at that point vm is going to start cleanup of apps etc, potentially before old ARC entries which may be of less value.

I'm sure this could be argued both ways, which is also potentially another reason to keep the sysctl, so we can experiment and tune easily. Either way so I'd like to hear others views on this.

As a final point the values in that spreadsheet seem to highlight that for large memory machines, the baseline vm values could potentially do with some upper limits so we don't have over 7GB or RAM sitting their doing nothing when paging triggers, but that's a discussion for another day.

In D702#23, @smh wrote:
In D702#22, @avg wrote:
In D702#19, @smh wrote:

introduction of the zfs_arc_free_target check completely obliviated the vm_paging_needed() but in a rather obscure way

Actually this was already the case with the original code:

if (kmem_map_used() > (kmem_size() * 3) / 4) {

Actually, no. Those are two different tests because one checks physcial memory and the other checks KVA.

These do however have quite a tight correlation unless your manually tuning.

Well, this is simply not true in general. In some cases there may be correlation, in other there is no correlation. When memory pressure comes from userland memory allocations (anonymous memory) then KVA does not have much play.
To be clear, I am speaking not about things that do not allow ARC size to reach its maximum. I am speaking about things that force ARC size to its minimum.
That is, my use case is such that userland applications consume a large proportion of memory. In this case ARC does not have a chance to use even half of the available RAM. But I do not want ARC to constantly be at its minimum.

In D702#22, @avg wrote:
In D702#19, @smh wrote:

ARC now would start to give up memory much sooner than it used to, I do not think that this is a universally good and acceptable change

The opposite is in fact true in my tests, before you'd start giving up memory at the 3/4 of kmem_size, which can be easily seen forcing reclaim at 40GB free on a 192GB machine.

But it was not true on my 8GB machine.
And your example suggests that your memory pressure came from kernel memory usage (by ARC itself and perhaps other usages).

There was no memory pressure apart from ARC itself and this resulted in 40GB was sat in free, which is to be expected given the previous code.

Obviously this is fine if the machine is using a decent % of RAM for other things, be that other kernel allocations or user, but if not the more RAM the higher the wastage.

I completely agree and that's why I completely support that part (KVA check) of your changes.

In D702#22, @avg wrote:
In D702#19, @smh wrote:

Finally, I do not think that making zfs_arc_free_target configurable via sysctl actually buys us anything.

The example that springs to mind is increasing vfs.zfs.arc_free_target while keeping vm.v_free_target the same. This can be easier to manage for some use cases across varying hardware than having to set vfs.zfs.arc_max, something we've already made use of here.

No, one can not be a substitute for the other.

Indeed it isn't a substitute, but it is an alternative which is easier to manage in some cases and also provides an additional way to tune ARC usage in a system which is ARC heavy for example, you may want to allow it to give up space more readily for other usages e.g. spiky apps if that's your workload.

Well, my argument is that the new parameter affects dynamic interaction between the pagedaemon and ARC. The bahavior of both is quite complex. And the interaction is not trivial as well. That parameter would not set some static limit but would affect dynamic behavior of a system.
Please remember that FreeBSD VM is built around concept that any free memory is a waste and so it wants to utilize as much of it as it can. The page cache always grows until a certain threshold is reached and only then pagedaemon is awaken to clean up some inactive pages.

I've knocked up the following spreadsheet to estimate the values in each case.
https://docs.google.com/spreadsheets/d/1mSL2aH72Vca-phBm7DpACZ-UDeXV15jx4goU1N8Xv-Q/edit?usp=sharing

Its clear from this that for high kmem consumers, be that ARC or other kernel use, that the old 1/4 mem is always going to be the reclaim trigger.

I think we all agree that said limit is of no real use on platforms that don't define UMA_MD_SMALL_ALLOC

YES!

so the question is do we keep the current limit of v_target_free or switch to vm_pageout_wakeup_thresh?

I've added the Solaris / illumos triggers as a useful comparison point, which shows that they trigger just over the values that the new v_target_free trigger provides which is why I believe its a good value to use, especially so given the feedback on the initial bug report.

One sentence, three distinct statements :-)

  • in illumos they indeed start reducing ARC a little bit earlier, but their page out algorithm is different too; I once asked the following question, but unfortunately I haven't got any answer: http://article.gmane.org/gmane.os.solaris.opensolaris.zfs/41241 I believe that that is a valid concern both for illumos and FreeBSD.
  • I believe that v_target_free is too large :-) I believe that on systems without abundance of RAM that value would lead to ARC eventually going to its minimum and oscillating there
  • the bug report and the feedback came in when the pagedaemon had the severe bug, so punishing the ARC served as workaround for pagedaemon's bad behavior

My concern with removing the new check and relying on vm_pageout_wakeup_thresh as the trigger is at that point vm is going to start cleanup of apps etc, potentially before old ARC entries which may be of less value.

Right. On the other hand there could be inactive app memory that is actually a waste of resources and that better be sent to swap while the ARC has all the hot data that is accessed all the time.

I'm sure this could be argued both ways, which is also potentially another reason to keep the sysctl, so we can experiment and tune easily. Either way so I'd like to hear others views on this.

I believe that it is impossible to hand-tune such a parameter.
In either case, I believe that a lot more testing and feedback is required before we settle this issue.

As a final point the values in that spreadsheet seem to highlight that for large memory machines, the baseline vm values could potentially do with some upper limits so we don't have over 7GB or RAM sitting their doing nothing when paging triggers, but that's a discussion for another day.

Agree on both accounts. Thank you!

P.S. If you get an impression that I am a bit passionate over this threshold, then it is true :-)
Please see this commit from 2010: https://github.com/freebsd/freebsd/commit/c94255f128c822492f902f30a35a1a2fa9df68d7

Integrate changes based on feedback from all parties.

  • Change the default value of zfs_arc_free_target from v_free_target to vm_pageout_wakeup_thresh (avg).
  • Include ARC reclaim updates from openzfs / illumos for diff minimisation which are nop changes (#ifdef'ed out) (smh).
  • Remove kmem_(page|free)_* methods replacing with #defines for: freemem, minfree, heap_arena (avg).
  • Trigger limited memory system code paths for arch's which don't define UMA_MD_SMALL_ALLOC not just i386 (alc).
  • Reduce zfs_arc_free_target fallback value (which should never be used) to 256MB from 2G (peter).
  • Fix upstream available_memory calculation using page value instead of bytes (smh).

The total result of these changes in combination with the already committed updates is:

  1. Removal of the limit 3/4 memory for none small alloc platforms.
  2. Provide sysctl to allow tuning of ARC reclaim trigger which defaults to vm_paging_needed() point.
  3. Incorporate missing upstream changes for diff minimisation.
  4. Add dtrace probes for ARC reclaim logic monitoring.

I have not included the removal of kmem_size() in favour of using physmem & vmem_size(...) at this time as that would require more testing.
Reference: https://github.com/avg-I/freebsd/compare/wip/hc/kmem_size-memguard-fix

Correct bracing for i386 ARC reclaim

Extend #ifdef _KERNEL around sysctl to protect against kernel only values

sys/cddl/compat/opensolaris/kern/opensolaris_kmem.c
166

is this still used for something?

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
205

I think that it should be fine to leave this default-initialized to zero. It would look nicer than some arbitrary limit.
I think that realistically there can never be a need for ARC reclaiming before pagedaemon initialization.

2541

UMA_MD_SMALL_ALLOC check makes __i386 check redundant

thank you! this looks good to me.

smh edited edge metadata.

Added missing machine/vmparam.h include causing all platforms to take the small memory code paths

Change the default of zfs_arc_free_target to (vm_pageout_wakeup_thresh / 2) * 3.

The previous default of vm_pageout_wakeup_thresh meant that main trigger for ARC reclaim was the pagedaemon vm_lowmem event. This trigger results in aggressive ARC reclaim which can easily cause ARC to get pegged down to its minimum value. The new default also closely mirrors that of Solaris which "We add extra pages here to make sure the scanner doesn't start up while we're freeing memory".

Also add more dtrace probes so we can better investigate ARC's reclaim and sizing decisions.

Remove spurious comment

FWIW, this line:

zfs_arc_free_target = (vm_pageout_wakeup_thresh / 2) * 3;

makes this approach still IMHO fatally flawed. pagedaemon does other work besides paging out pages. For example, it does processing of the inactive queue.

Changing the window from ~300% of the wakeup threshold to 150% of the wakeup threshold merely makes the window smaller where ARC can get into a pathological state where it can free itself fast enough that pagedaemon never seriously processes the inactive queue. Narrowing the window but not closing it means the hole is still there.

For context, this is the pathological failure case we hit in the pkg / ftp mirrors:

CPU: 0.0% user, 0.0% nice, 0.4% system, 0.1% interrupt, 99.4% idle
Mem: 199M Active, 28G Inact, 2027M Wired, 456M Cache, 49M Free
ARC: 906M Total, 171M MFU, 149M MRU, 1632K Anon, 179M Header, 405M Other
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
369 nobody 1 20 0 25024K 5344K db->db 15 0:13 0.72% rsync
368 nobody 1 20 0 25024K 5436K zio->i 4 0:15 0.72% rsync
10713 root 1 20 0 20928K 3600K zio->i 7 0:03 0.41% rsync
10490 14 1 20 0 16636K 1788K zio->i 7 0:01 0.36% vsftpd
10567 14 1 20 0 16636K 1788K sbwait 7 0:01 0.29% vsftpd
12156 peter 1 20 0 21880K 3060K CPU4 4 0:00 0.18% top
70056 root 1 20 0 20528K 10612K zio->i 5 0:48 0.09% find
12897 root 1 52 0 17068K 2080K wait 10 0:00 0.09% sh
11089 5211 1 20 0 33216K 22468K zio->i 4 0:01 0.08% rsync
1361 www 1 20 0 32700K 4892K zio->i 0 1:55 0.06% nginx
1360 www 1 20 0 36796K 7956K zio->i 13 2:09 0.06% nginx
4066 14 1 20 0 16636K 1788K sbwait 3 0:01 0.05% vsftpd
1362 www 1 20 0 32700K 4648K zio->i 2 2:03 0.05% nginx
1596 www 1 20 0 32700K 4048K kqread 5 3:02 0.05% nginx
11951 5211 1 20 0 16832K 6044K zio->i 2 0:00 0.04% rsync
1358 www 1 20 0 32700K 5128K zio->i 0 2:21 0.04% nginx
40265 5211 1 20 0 37312K 23200K zio->i 3 0:08 0.04% rsync
4794 5211 1 20 0 20928K 7412K zio->i 7 0:02 0.04% rsync
58936 5211 1 20 0 16832K 7360K zio->i 1 0:04 0.04% rsync
77146 5211 1 20 0 41408K 24152K zio->i 3 0:06 0.04% rsync
44375 5211 1 20 0 29120K 15512K zio->i 1 0:06 0.04% rsync
1394 root 1 20 0 18660K 1600K select 8 0:36 0.03% inetd
89006 5211 1 20 0 37312K 22440K zio->i 3 0:05 0.03% rsync`

ARC was able to free itself up fast enough to prevent pagedaemon from waking up and dealing with the growth of the inactive queue.

This solved it:

sysctl vfs.zfs.arc_free_target=sysctl -n vm.pageout_wakeup_thresh

vfs.zfs.arc_free_target: 173913 -> 56628

After a few minutes ARC was up to 5GB and performance returned.

In D702#42, @peter wrote:

FWIW, this line:

zfs_arc_free_target = (vm_pageout_wakeup_thresh / 2) * 3;

makes this approach still IMHO fatally flawed. pagedaemon does other work besides paging out pages. For example, it does processing of the inactive queue.

Changing the window from ~300% of the wakeup threshold to 150% of the wakeup threshold merely makes the window smaller where ARC can get into a pathological state where it can free itself fast enough that pagedaemon never seriously processes the inactive queue. Narrowing the window but not closing it means the hole is still there.

Keeping it purely at vm_pageout_wakeup_thresh easily causes ARC to get pegged back min with any real memory pressure, raising it slightly like this prevents this undesired behaviour as it gives the the chance to do normal reclaim vs always triggering aggressive reclaim.

sys/cddl/compat/opensolaris/kern/opensolaris_kmem.c
166

Yes still used in arc_init, think we should look to eliminate this in a separate update.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
2541

Yer I left original __i386 solaris check so its obvious

In D702#43, @peter wrote:

For context, this is the pathological failure case we hit in the pkg / ftp mirrors:

CPU: 0.0% user, 0.0% nice, 0.4% system, 0.1% interrupt, 99.4% idle
Mem: 199M Active, 28G Inact, 2027M Wired, 456M Cache, 49M Free
ARC: 906M Total, 171M MFU, 149M MRU, 1632K Anon, 179M Header, 405M Other
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
369 nobody 1 20 0 25024K 5344K db->db 15 0:13 0.72% rsync

..

ARC was able to free itself up fast enough to prevent pagedaemon from waking up and dealing with the growth of the inactive queue.

This solved it:

sysctl vfs.zfs.arc_free_target=sysctl -n vm.pageout_wakeup_thresh

vfs.zfs.arc_free_target: 173913 -> 56628

After a few minutes ARC was up to 5GB and performance returned.

I'm afraid any tests done with the current version in head should be totally discounted as there are a number of issues present.

Please retest with this entire patch applied and let us know what you see.

In D702#45, @smh wrote:

I'm afraid any tests done with the current version in head should be totally discounted as there are a number of issues present.

Please retest with this entire patch applied and let us know what you see.

My personal opinion that you should already commit something to undo the damage. I would prefer that you first commit a version where the default threshold is equal to the pagedaemon's threshold as it was before. Any further research and improvement can be committed at a later time after necessary discussion and testing (by different parties, not only you and Karl).

Peter would you be happy with this with the default being vm_pageout_wakeup_thresh?

I know this will definitely cause pegging of ARC at arc_min for certain loads, but if that's what people are most happy with currently and would be happy for this to hit 10.1 at least we'll have fixed one of the key issues.

Back out zfs_arc_free_target = (vm_pageout_wakeup_thresh / 2) * 3 change based on feedback.

So guys what do you think, are you happy with this latest revision?

A couple of quick thoughts:

  • what's with all the new #ifdef __sun code? Where did that come from?
  • What is the __used thing about on code that's clearly used? (eg: vm_pageout_scan(), which is used unconditionally)

The churn is so bad I can't keep track of what's what anymore. I'm probably going to have to get a raw diff to compare it to r270758 or so because I can't make sense of this now.

At the end of the patch, are we effectively back to this?




Index: cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
===================================================================
--- cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c (revision 270758)
+++ cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c (working copy)
@@ -2505,8 +2505,10 @@
return (1);
#endif
#else /* !sun */
+#if defined(__i386) || !defined(UMA_MD_SMALL_ALLOC)
if (kmem_used() > (kmem_size() * 3) / 4)
return (1);
+#endif
#endif /* sun */

#else


I think so, but its really hard to tell with all the churn and renaming/moving. That 2 line change is basically what I was after.

In D702#51, @peter wrote:

A couple of quick thoughts:

  • what's with all the new #ifdef __sun code? Where did that come from?

Those are changes which we where missing from / moved in the upstream illumos version. They've been missed out over time, I put them in to eliminate the differences moving forward.

  • What is the __used thing about on code that's clearly used? (eg: vm_pageout_scan(), which is used unconditionally)

That's to allow dtrace to hook that method, without it we loose the fbt end points due to compiler optimisations.

The churn is so bad I can't keep track of what's what anymore. I'm probably going to have to get a raw diff to compare it to r270758 or so because I can't make sense of this now.

Here's a summary of the key differences :

  1. Default zfs_arc_free_target to vm_pageout_wakeup_thresh which matches the original vm_paging_needed() test.
  2. Update i386 code path in arc_reclaim_needed to match illumos (remove unneeded btop conversions)
  3. Use illumos i386 code path in arc_reclaim_needed and arc_memory_throttle for archs which define UMA_MD_SMALL_ALLOC.
  4. Fix bytes vs pages conversion in arc_memory_throttle i386 code path.
  5. Match pageout test in arc_memory_throttle with illumos now we have minfree available: MAX(ptob(minfree), available_memory) vs available_memory

Misc fixes which have no real functional effect:

  • New dtrace probes
  • Difference elimination with illumos,
  • Use defines for freemem, minfree and head_area recommended by avg which eliminates the new kmem_free_XXX methods and further reduces the difference with illumos
  • Set fallback value for zfs_arc_free_target to lower value

Hope this helps.

used is certainly a wrong tool for the job. noinline should be used instead.

Could you please address my comment about initial value of zfs_arc_free_target ?

In D702#53, @avg wrote:

used is certainly a wrong tool for the job. noinline should be used instead.

Its not inlining it its excluding it from the symbol table due to being static. I'll retest to see if its needed now it has a manual STD_PROBE.

In D702#55, @smh wrote:
In D702#53, @avg wrote:

used is certainly a wrong tool for the job. noinline should be used instead.

Its not inlining it its excluding it from the symbol table due to being static. I'll retest to see if its needed now it has a manual STD_PROBE.

Are you sure? If what you say is true, then there would be no fbt probes for all static functions, but that is not the case clearly.

In D702#54, @avg wrote:

Could you please address my comment about initial value of zfs_arc_free_target ?

Could you clarify, which comment? There's quite a few in there and I thought I'd covered them all ;-)

That zfs_arc_free_target does not need to be initialized with a weird magic value. default zero initialization would be just fine.

In D702#56, @avg wrote:
In D702#55, @smh wrote:
In D702#53, @avg wrote:

used is certainly a wrong tool for the job. noinline should be used instead.

Its not inlining it its excluding it from the symbol table due to being static. I'll retest to see if its needed now it has a manual STD_PROBE.

Are you sure? If what you say is true, then there would be no fbt probes for all static functions, but that is not the case clearly.

Just tested and your totally correct. My miss-understanding stemmed from the fact that removing static fixed it along with __used being recommended as the fix when I initially researched it, which worked.

Clear removing static just confused the optimiser hence preventing it from inlining it. With the addition of the manual probe which is the real interest I'm going to remove that from the next version.

Thanks for highlighting this I learned little insight into how compilers operate in this regard today :)

Remove unused as the STD probes are sufficient and should have been noinline anyway.

Initialise zfs_arc_free_target to zero.

Switched remaining unused for noinline

Sorry to keep pestering guys, but what are your thoughts now?

There is still more work to be done in this area, but I believe this in its current form lays down a good foundation which we can built upon moving forward; so I'd like to get this in if there are no outstanding objections?

sys/vm/vm_pageout.c
926

This fix, which prevents lowmem event stalls on overflow, will be committed separately, here for testing purposes only.

See: https://reviews.freebsd.org/D818

Updated based of r272222, no local changes.

Thanks Andriy, in that case could you accept the revision please :)

avg edited edge metadata.
This revision is now accepted and ready to land.Sep 27 2014, 9:13 PM

Peter unless you have any objections I'm going to get this into the tree this week.