i386 has KVA limits that the raw page counts above don't consider improve logic in arc_memory_throttle to take this into account
Also rename kmem_used -> kmem_map_used for consitency with VM name and to avoid confusion
Differential D702
Improved i386 support in arc_memory_throttle smh on Aug 31 2014, 2:37 PM. Authored by Tags None Referenced Files
Details
Diff Detail
Event TimelineComment Actions Hmm. I find this whole section of code troubling. First we have the physical page pool wrappers: On the other hand there's the actual kva/kernel memory wrappers: The last one is special as it's both kmem and non-kmem at the same time. At a glance, I can't tell which is which any more. There separation of kmem_map_ vs kmem_page_* sort of helps, but it is still rather inconsistent, eg: kmem_free_count() which is non-kmem physical pages, but not called kmem_page_*. Are you sure you want to get into this? :) The #ifdef i386 block (and addition of the kmem_map_free() function) looks about right though. The two changes that implement the title of the review seem fine but the unrelated parts give me pause - more work is needed there. .. probably shouldn't mix two issues into one. Comment Actions Thanks peter, if its wrong lets get if fixed so its clear for the future. I'll keep it all in this one review for now and split if out if you think that's needed on commit. Comment Actions Ok so thats the functions you mentioned renamed so their meaning is clearer, with the exception of kmem_size(). Its used in the following locations:- void arc_init(void) { ... /* Start out with 1/8 of all memory */ arc_c = kmem_size() / 8; ... /* * Allow the tunables to override our calculations if they are * reasonable (ie. over 16MB) */ if (zfs_arc_max > 64<<18 && zfs_arc_max < kmem_size()) arc_c_max = zfs_arc_max;... ... if (kmem_size() < 512 * (1 << 20)) { printf("ZFS WARNING: Recommended minimum kmem_size is 512MB; " "expect unstable behavior.\n"); printf(" Consider tuning vm.kmem_size and " "vm.kmem_size_max\n"); printf(" in /boot/loader.conf.\n"); } static int arc_reclaim_needed(void) { ... #ifdef __i386__ /* i386 has KVA limits that the raw page counts above don't consider */ if (kmem_map_used() > (kmem_size() * 3) / 4) { DTRACE_PROBE2(arc__reclaim_used, uint64_t, kmem_used(), uint64_t, (kmem_size() * 3) / 4); return (1); } #endif So what do you think, the questions that spring to mind are:-
Comment Actions Note, this may be more of a review for the code that was already committed rather than of the proposed changes. First, as I've already said in an email, I suggest removing kmem_* or at the very least placing them into a different location. #define freemem (vm_cnt.v_free_count + vm_cnt.v_cache_count) could be useful in compat. This way some more of differences from the upstream code could be eliminated. Comment Actions If we un-iline a couple of function is arc_reclaim_needed() then we will see the following code: if (vm_cnt.v_free_count + vm_cnt.v_cache_count < zfs_arc_free_target) { DTRACE_PROBE2(arc__reclaim_freetarget, uint64_t, kmem_free_count(), uint64_t, zfs_arc_free_target); return (1); } /* * Cooperate with pagedaemon when it's time for it to scan * and reclaim some pages. */ if (vm_cnt.v_free_count + vm_cnt.v_cache_count < vm_pageout_wakeup_thresh) { DTRACE_PROBE(arc__reclaim_paging); return (1); } two checks above can be combined as follows if (vm_cnt.v_free_count + vm_cnt.v_cache_count < zfs_arc_free_target || vm_cnt.v_free_count + vm_cnt.v_cache_count < vm_pageout_wakeup_thresh) return 1; which is equivalent to: if (vm_cnt.v_free_count + vm_cnt.v_cache_count < MAX(zfs_arc_free_target, vm_pageout_wakeup_thresh)) using the default value of zfs_arc_free_target: if (vm_cnt.v_free_count + vm_cnt.v_cache_count < MAX(vm_cnt.v_free_target, vm_pageout_wakeup_thresh)) Given that v_free_target is much larger than vm_pageout_wakeup_thresh on an auto-tuned system the above reduction shows that:
Finally, I do not think that making zfs_arc_free_target configurable via sysctl actually buys us anything. Comment Actions
Actually this was already the case with the original code: if (kmem_map_used() > (kmem_size() * 3) / 4) {
The opposite is in fact true in my tests, before you'd start giving up memory at the 3/4 of kmem_size, which can be easily seen forcing reclaim at 40GB free on a 192GB machine.
The example that springs to mind is increasing vfs.zfs.arc_free_target while keeping vm.v_free_target the same. This can be easier to manage for some use cases across varying hardware than having to set vfs.zfs.arc_max, something we've already made use of here. Comment Actions Actually, no. Those are two different tests because one checks physcial memory and the other checks KVA.
But it was not true on my 8GB machine. In either case, dropping the KVA check on amd64 is the correct part of your changes. I do not have anything against it.
No, one can not be a substitute for the other. Comment Actions These do however have quite a tight correlation unless your manually tuning. There was no memory pressure apart from ARC itself and this resulted in 40GB was sat in free, which is to be expected given the previous code. Obviously this is fine if the machine is using a decent % of RAM for other things, be that other kernel allocations or user, but if not the more RAM the higher the wastage. Indeed it isn't a substitute, but it is an alternative which is easier to manage in some cases and also provides an additional way to tune ARC usage in a system which is ARC heavy for example, you may want to allow it to give up space more readily for other usages e.g. spiky apps if that's your workload. I've knocked up the following spreadsheet to estimate the values in each case. Its clear from this that for high kmem consumers, be that ARC or other kernel use, that the old 1/4 mem is always going to be the reclaim trigger. I think we all agree that said limit is of no real use on platforms that don't define UMA_MD_SMALL_ALLOC, so the question is do we keep the current limit of v_target_free or switch to vm_pageout_wakeup_thresh? I've added the Solaris / illumos triggers as a useful comparison point, which shows that they trigger just over the values that the new v_target_free trigger provides which is why I believe its a good value to use, especially so given the feedback on the initial bug report. My concern with removing the new check and relying on vm_pageout_wakeup_thresh as the trigger is at that point vm is going to start cleanup of apps etc, potentially before old ARC entries which may be of less value. I'm sure this could be argued both ways, which is also potentially another reason to keep the sysctl, so we can experiment and tune easily. Either way so I'd like to hear others views on this. As a final point the values in that spreadsheet seem to highlight that for large memory machines, the baseline vm values could potentially do with some upper limits so we don't have over 7GB or RAM sitting their doing nothing when paging triggers, but that's a discussion for another day. Comment Actions Well, this is simply not true in general. In some cases there may be correlation, in other there is no correlation. When memory pressure comes from userland memory allocations (anonymous memory) then KVA does not have much play. Comment Actions
I completely agree and that's why I completely support that part (KVA check) of your changes. Comment Actions
Well, my argument is that the new parameter affects dynamic interaction between the pagedaemon and ARC. The bahavior of both is quite complex. And the interaction is not trivial as well. That parameter would not set some static limit but would affect dynamic behavior of a system.
YES!
One sentence, three distinct statements :-)
Right. On the other hand there could be inactive app memory that is actually a waste of resources and that better be sent to swap while the ARC has all the hot data that is accessed all the time.
I believe that it is impossible to hand-tune such a parameter.
Agree on both accounts. Thank you! P.S. If you get an impression that I am a bit passionate over this threshold, then it is true :-) Comment Actions Integrate changes based on feedback from all parties.
The total result of these changes in combination with the already committed updates is:
I have not included the removal of kmem_size() in favour of using physmem & vmem_size(...) at this time as that would require more testing.
Comment Actions Added missing machine/vmparam.h include causing all platforms to take the small memory code paths Comment Actions Change the default of zfs_arc_free_target to (vm_pageout_wakeup_thresh / 2) * 3. The previous default of vm_pageout_wakeup_thresh meant that main trigger for ARC reclaim was the pagedaemon vm_lowmem event. This trigger results in aggressive ARC reclaim which can easily cause ARC to get pegged down to its minimum value. The new default also closely mirrors that of Solaris which "We add extra pages here to make sure the scanner doesn't start up while we're freeing memory". Also add more dtrace probes so we can better investigate ARC's reclaim and sizing decisions. Comment Actions FWIW, this line: zfs_arc_free_target = (vm_pageout_wakeup_thresh / 2) * 3; makes this approach still IMHO fatally flawed. pagedaemon does other work besides paging out pages. For example, it does processing of the inactive queue. Changing the window from ~300% of the wakeup threshold to 150% of the wakeup threshold merely makes the window smaller where ARC can get into a pathological state where it can free itself fast enough that pagedaemon never seriously processes the inactive queue. Narrowing the window but not closing it means the hole is still there. Comment Actions For context, this is the pathological failure case we hit in the pkg / ftp mirrors: CPU: 0.0% user, 0.0% nice, 0.4% system, 0.1% interrupt, 99.4% idle ARC was able to free itself up fast enough to prevent pagedaemon from waking up and dealing with the growth of the inactive queue. This solved it: sysctl vfs.zfs.arc_free_target=sysctl -n vm.pageout_wakeup_threshvfs.zfs.arc_free_target: 173913 -> 56628 After a few minutes ARC was up to 5GB and performance returned. Comment Actions Keeping it purely at vm_pageout_wakeup_thresh easily causes ARC to get pegged back min with any real memory pressure, raising it slightly like this prevents this undesired behaviour as it gives the the chance to do normal reclaim vs always triggering aggressive reclaim.
Comment Actions ..
I'm afraid any tests done with the current version in head should be totally discounted as there are a number of issues present. Please retest with this entire patch applied and let us know what you see. Comment Actions My personal opinion that you should already commit something to undo the damage. I would prefer that you first commit a version where the default threshold is equal to the pagedaemon's threshold as it was before. Any further research and improvement can be committed at a later time after necessary discussion and testing (by different parties, not only you and Karl). Comment Actions Peter would you be happy with this with the default being vm_pageout_wakeup_thresh? I know this will definitely cause pegging of ARC at arc_min for certain loads, but if that's what people are most happy with currently and would be happy for this to hit 10.1 at least we'll have fixed one of the key issues. Comment Actions Back out zfs_arc_free_target = (vm_pageout_wakeup_thresh / 2) * 3 change based on feedback. Comment Actions A couple of quick thoughts:
The churn is so bad I can't keep track of what's what anymore. I'm probably going to have to get a raw diff to compare it to r270758 or so because I can't make sense of this now. At the end of the patch, are we effectively back to this?
I think so, but its really hard to tell with all the churn and renaming/moving. That 2 line change is basically what I was after. Comment Actions Those are changes which we where missing from / moved in the upstream illumos version. They've been missed out over time, I put them in to eliminate the differences moving forward.
That's to allow dtrace to hook that method, without it we loose the fbt end points due to compiler optimisations.
Here's a summary of the key differences :
Misc fixes which have no real functional effect:
Hope this helps. Comment Actions Its not inlining it its excluding it from the symbol table due to being static. I'll retest to see if its needed now it has a manual STD_PROBE. Comment Actions Are you sure? If what you say is true, then there would be no fbt probes for all static functions, but that is not the case clearly. Comment Actions Could you clarify, which comment? There's quite a few in there and I thought I'd covered them all ;-) Comment Actions That zfs_arc_free_target does not need to be initialized with a weird magic value. default zero initialization would be just fine. Comment Actions Just tested and your totally correct. My miss-understanding stemmed from the fact that removing static fixed it along with __used being recommended as the fix when I initially researched it, which worked. Clear removing static just confused the optimiser hence preventing it from inlining it. With the addition of the manual probe which is the real interest I'm going to remove that from the next version. Thanks for highlighting this I learned little insight into how compilers operate in this regard today :) Comment Actions Remove unused as the STD probes are sufficient and should have been noinline anyway. Initialise zfs_arc_free_target to zero. Comment Actions Sorry to keep pestering guys, but what are your thoughts now? There is still more work to be done in this area, but I believe this in its current form lays down a good foundation which we can built upon moving forward; so I'd like to get this in if there are no outstanding objections?
|