Make the partpopq LRU sloppy to reduce contention
ClosedPublic
Actions

Authored by jeff on Jun 23 2018, 8:54 AM.

Details

Reviewers

alc
kib
markj

Summary

This gives something like an 8x speedup in a pagefault test that is single threaded by the vm_reserv_domain_lock.

The idea is to only update the partpopq list when it is required or when PARTPOPSLOP ticks have expired since the last update for this reservation This has the effect of making the lru roughly sorted.

I don't feel strongly about the right tick count, I just wanted to spread the lock acquires somewhat so every cpu didn't immediately contend as the tick updated.

Test Plan

stress, will-it-scale.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

jeff created this revision.Jun 23 2018, 8:54 AM

Herald added a subscriber: imp. · View Herald TranscriptJun 23 2018, 8:54 AM

How much random becomes the order of the partpopq ? Is there any way to evaluate it ?

I mean, a tick is a lot, so instead of only doing it at tick, deletegate the limited (?) sorting of the rvq_partpop queues by lasttick to a daemon.

In D15977#338275, @kib wrote:

How much random becomes the order of the partpopq ? Is there any way to evaluate it ?

I mean, a tick is a lot, so instead of only doing it at tick, deletegate the limited (?) sorting of the rvq_partpop queues by lasttick to a daemon.

When I spoke with Alan he was not concerned by it. The time differences larger than a tick will be sorted. So you only have the subtick sorting to worry about. Originally I had thought about using two lists and periodically appending one to the other. I think any sorting overhead is going to require too long a lock hold time to be realistic. And you'd need to use a faster time counter than tick which begins to get expensive. We get acceptable perf with 1 tick slop, but better with 3 and it feels more future proof.

The other thing to consider is how accurate it needs to be and already is. Which thread was scheduled in the last tick is just as arbitrary as this LRU. You just want to replace something that hasn't been used in a long time. I would guess we're more often looking at things that haven't been touched in seconds or at least hundreds of milliseconds than within a few ticks. I can measure that but it will of course be grossly dependent on the workload and amount of memory.

In D15977#338276, @jeff wrote:

The other thing to consider is how accurate it needs to be and already is. Which thread was scheduled in the last tick is just as arbitrary as this LRU. You just want to replace something that hasn't been used in a long time. I would guess we're more often looking at things that haven't been touched in seconds or at least hundreds of milliseconds than within a few ticks. I can measure that but it will of course be grossly dependent on the workload and amount of memory.

If we do not need an LRU, might be we do not need the partpopq list at all ? E.g. keeping a bins of generations per ticks, up to some limited number of bins.

This revision is now accepted and ready to land.Jun 23 2018, 10:30 AM

markj added inline comments.Jun 23 2018, 10:30 AM

vm/vm_reserv.c
441	Don't you want the unsigned difference here? In the unlikely event that 2^31 or more ticks have elapsed since the last update, we'll fail to update the queue. Same comment below.

In D15977#338278, @kib wrote:

In D15977#338276, @jeff wrote:

The other thing to consider is how accurate it needs to be and already is. Which thread was scheduled in the last tick is just as arbitrary as this LRU. You just want to replace something that hasn't been used in a long time. I would guess we're more often looking at things that haven't been touched in seconds or at least hundreds of milliseconds than within a few ticks. I can measure that but it will of course be grossly dependent on the workload and amount of memory.

If we do not need an LRU, might be we do not need the partpopq list at all ? E.g. keeping a bins of generations per ticks, up to some limited number of bins.

Yes I thought about this. Something like a calendar queue but I think it gets complicated. This is still LRU it is just correct within a few ticks, further apart are still sorted. Cache eviction in your cpu is also an approximation. I think this is a good tradeoff between performance, complexity, and accuracy.

vm/vm_reserv.c
441	yes you're right. I had only thought about wrapping, not incredibly long time periods.

Once (u_int) casts are added to both tick comparisons, I'm okay with this change.

Jeff, can you please close this review. This change was committed 6 weeks ago, but the commit didn't cite this review, so it wasn't automatically closed.

jeff closed this revision.Aug 27 2018, 8:18 PM

Revision Contents
Changeset List

Path

Size

vm/

vm_reserv.c

33 lines

Diff 44338

View Options

Make the partpopq LRU sloppy to reduce contentionClosedPublicActions