Page MenuHomeFreeBSD

Make the partpopq LRU sloppy to reduce contention
ClosedPublic

Authored by jeff on Jun 23 2018, 8:54 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jun 28, 9:51 AM
Unknown Object (File)
Dec 20 2023, 8:12 AM
Unknown Object (File)
Aug 27 2023, 6:52 AM
Unknown Object (File)
Aug 13 2023, 6:14 AM
Unknown Object (File)
Aug 11 2023, 5:32 PM
Subscribers

Details

Reviewers
alc
kib
markj
Summary

This gives something like an 8x speedup in a pagefault test that is single threaded by the vm_reserv_domain_lock.

The idea is to only update the partpopq list when it is required or when PARTPOPSLOP ticks have expired since the last update for this reservation This has the effect of making the lru roughly sorted.

I don't feel strongly about the right tick count, I just wanted to spread the lock acquires somewhat so every cpu didn't immediately contend as the tick updated.

Test Plan

stress, will-it-scale.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

How much random becomes the order of the partpopq ? Is there any way to evaluate it ?

I mean, a tick is a lot, so instead of only doing it at tick, deletegate the limited (?) sorting of the rvq_partpop queues by lasttick to a daemon.

In D15977#338275, @kib wrote:

How much random becomes the order of the partpopq ? Is there any way to evaluate it ?

I mean, a tick is a lot, so instead of only doing it at tick, deletegate the limited (?) sorting of the rvq_partpop queues by lasttick to a daemon.

When I spoke with Alan he was not concerned by it. The time differences larger than a tick will be sorted. So you only have the subtick sorting to worry about. Originally I had thought about using two lists and periodically appending one to the other. I think any sorting overhead is going to require too long a lock hold time to be realistic. And you'd need to use a faster time counter than tick which begins to get expensive. We get acceptable perf with 1 tick slop, but better with 3 and it feels more future proof.

The other thing to consider is how accurate it needs to be and already is. Which thread was scheduled in the last tick is just as arbitrary as this LRU. You just want to replace something that hasn't been used in a long time. I would guess we're more often looking at things that haven't been touched in seconds or at least hundreds of milliseconds than within a few ticks. I can measure that but it will of course be grossly dependent on the workload and amount of memory.

In D15977#338276, @jeff wrote:

The other thing to consider is how accurate it needs to be and already is. Which thread was scheduled in the last tick is just as arbitrary as this LRU. You just want to replace something that hasn't been used in a long time. I would guess we're more often looking at things that haven't been touched in seconds or at least hundreds of milliseconds than within a few ticks. I can measure that but it will of course be grossly dependent on the workload and amount of memory.

If we do not need an LRU, might be we do not need the partpopq list at all ? E.g. keeping a bins of generations per ticks, up to some limited number of bins.

This revision is now accepted and ready to land.Jun 23 2018, 10:30 AM
vm/vm_reserv.c
441

Don't you want the unsigned difference here? In the unlikely event that 2^31 or more ticks have elapsed since the last update, we'll fail to update the queue. Same comment below.

In D15977#338278, @kib wrote:
In D15977#338276, @jeff wrote:

The other thing to consider is how accurate it needs to be and already is. Which thread was scheduled in the last tick is just as arbitrary as this LRU. You just want to replace something that hasn't been used in a long time. I would guess we're more often looking at things that haven't been touched in seconds or at least hundreds of milliseconds than within a few ticks. I can measure that but it will of course be grossly dependent on the workload and amount of memory.

If we do not need an LRU, might be we do not need the partpopq list at all ? E.g. keeping a bins of generations per ticks, up to some limited number of bins.

Yes I thought about this. Something like a calendar queue but I think it gets complicated. This is still LRU it is just correct within a few ticks, further apart are still sorted. Cache eviction in your cpu is also an approximation. I think this is a good tradeoff between performance, complexity, and accuracy.

vm/vm_reserv.c
441

yes you're right. I had only thought about wrapping, not incredibly long time periods.

Once (u_int) casts are added to both tick comparisons, I'm okay with this change.

Jeff, can you please close this review. This change was committed 6 weeks ago, but the commit didn't cite this review, so it wasn't automatically closed.