Page MenuHomeFreeBSD

tmpfs: for used pages, account really allocated pages, instead of file sizes
ClosedPublic

Authored by kib on Oct 23 2022, 11:41 AM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Jan 9, 6:36 AM
Unknown Object (File)
Sat, Jan 7, 7:45 AM
Unknown Object (File)
Dec 16 2022, 7:42 AM
Unknown Object (File)
Dec 14 2022, 3:39 AM
Unknown Object (File)
Dec 11 2022, 3:34 PM
Unknown Object (File)
Dec 9 2022, 12:18 PM
Unknown Object (File)
Nov 30 2022, 10:16 PM
Subscribers

Details

Summary
tmpfs: correct reported st_blocks/va_bytes

Previously the reported value did not accounted for the swapped out
pages.
uiomove_object: hide diagnostic under bootverbose
tmpfs: for used pages, account really allocated pages, instead of file sizes
This makes tmpfs size accounting correct for the sparce files.

PR:     223015
vm_pager: add method to veto page allocation
vm_pager: add methods for page insertion and removal notifications
tmpfs: make vm_object point to the tmpfs node instead of vnode

The vnode could be reclaimed and allocated again during the lifecycle of
the node, but the node cannot.  Also, referencing the node would allow
to reach it and tmpfs mount data from the object, regardless of the
state of the possibly absent vnode.

Still use swp_tmpfs for back-pointer, instead of using handle. Use of
named swap objects would incur taking the sw_alloc_sx on node allocation
and deallocation.

swp_tmpfs is renamed to swp_priv to remove the last bit of tmpfs in vm/.
Add 'show tmpfs' ddb command

Tested by: pho

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

kib requested review of this revision.Oct 23 2022, 11:41 AM
kib edited the summary of this revision. (Show Details)

correct reported st_blocks/va_bytes

I somewhat dislike the extra branch and memory accesses in vm_page_{insert,remove,alloc}(). The facts that vm_page_grab_valid() needs to re-check vm_pager_can_alloc() after a page allocation failure, and tmpfs has to handle unlinking itself (i.e., vm_object_terminate() does not handle accounting for you) also suggests that the abstraction is not quite right.

sys/fs/tmpfs/tmpfs_vfsops.c
47 ↗(On Diff #112137)

opt_ddb.h should be included first.

705 ↗(On Diff #112137)
sys/fs/tmpfs/tmpfs_vnops.c
475 ↗(On Diff #112137)

Why can't it use obj->resident_page_count?

kib marked 3 inline comments as done.Oct 24 2022, 3:33 PM

I somewhat dislike the extra branch and memory accesses in vm_page_{insert,remove,alloc}(). The facts that vm_page_grab_valid() needs to re-check vm_pager_can_alloc() after a page allocation failure, and tmpfs has to handle unlinking itself (i.e., vm_object_terminate() does not handle accounting for you) also suggests that the abstraction is not quite right.

Well. Rechecks in vm_page_grab_valid() and vm_fault_allocate() are there because vm_page_alloc() cannot properly report the cause for the failed allocation. It assumes that any failure is retry-able.
On the other hand, vm_object_terminate() is micro-optimized to not remove individual pages from the queue, that is the layering violation IMO. But it is there, and I have to work around it. BTW, vm_pager_deallocate() is too late in vm_object_terminate() as well, and I believe it is at least second time when it was too late (see OBJ_PG_DTOR).

I decided it is too much to change vm_page_alloc() interface or use OBJ_PG_DTOR there. One version of the patch added pager callback right before pages were removed, but then I realized that it is simpler to adjust the counter from single place of node destructor.

sys/fs/tmpfs/tmpfs_vnops.c
475 ↗(On Diff #112137)

Because some pages could be swapped out.

kib marked an inline comment as done.

Eliminate swap_pager_total_pages(), tn_reg.tn_pages is faster and should give the same number.

In D37097#842579, @kib wrote:

I somewhat dislike the extra branch and memory accesses in vm_page_{insert,remove,alloc}(). The facts that vm_page_grab_valid() needs to re-check vm_pager_can_alloc() after a page allocation failure, and tmpfs has to handle unlinking itself (i.e., vm_object_terminate() does not handle accounting for you) also suggests that the abstraction is not quite right.

Well. Rechecks in vm_page_grab_valid() and vm_fault_allocate() are there because vm_page_alloc() cannot properly report the cause for the failed allocation. It assumes that any failure is retry-able.
On the other hand, vm_object_terminate() is micro-optimized to not remove individual pages from the queue, that is the layering violation IMO. But it is there, and I have to work around it. BTW, vm_pager_deallocate() is too late in vm_object_terminate() as well, and I believe it is at least second time when it was too late (see OBJ_PG_DTOR).

I decided it is too much to change vm_page_alloc() interface or use OBJ_PG_DTOR there. One version of the patch added pager callback right before pages were removed, but then I realized that it is simpler to adjust the counter from single place of node destructor.

Really my concern is that tmpfs is now doing too much second-guessing of the VM system.

Rather than counting pages, can we instead track the total sizes of holes (i.e., no resident pages or swap space is allocated) in each file? Holes are created by tmpfs_truncate() and tmpfs_deallocate(), and filled in by tmpfs_write() (really, in uiomove_object()) and getpages (currently not intercepted by tmpfs, but it could be). So, tmpfs might be able to perform its accounting more independently.

sys/fs/tmpfs/tmpfs_subr.c
252 ↗(On Diff #112141)

Suppose I write 4096 bytes to a new tmpfs file, so tn_pages == 1 and the page is dirty. Suppose the page is laundered, so a copy exists on the swap device. Suppose that the page is not freed after this point (but it remains clean). Then I truncate the file to 0 bytes. vm_object_page_remove() will free the page, but here we will not decrement tn_pages because the pager still has a copy. Then vm_object_page_remove() frees swap space in the truncated region, but nothing will decrement tn_pages, I think.

726 ↗(On Diff #112141)

Shouldn't this be a subtraction?

sys/fs/tmpfs/tmpfs_vnops.c
475 ↗(On Diff #112137)

Sorry, I didn't look carefully at the commit descriptions. But, how do we handle the case where a page is resident and clean, and there is a copy on swap? I think such a page must be counted twice, since the tmpfs mount size includes consumed swap space, but tmpfs does not get any notification from putpages.

kib marked 2 inline comments as done.Oct 26 2022, 10:17 PM

Rather than counting pages, can we instead track the total sizes of holes (i.e., no resident pages or swap space is allocated) in each file? Holes are created by tmpfs_truncate() and tmpfs_deallocate(), and filled in by tmpfs_write() (really, in uiomove_object()) and getpages (currently not intercepted by tmpfs, but it could be). So, tmpfs might be able to perform its accounting more independently.

Doing something along these lines was my first attempt (even before SEEK_HOLE patch). I decided that it is impossible to do, or rather, I still need a hook into vm_page_grab_*(). Problem is that uiomove_object() cannot determine if it is filling hole or writing to the existing page. So vm_page_grab_valid() must report it, and then the patch becomes half of the current patch, but uglier. Then we need a hook into vm_fault(), and basically we end up with what I presented in this review.

sys/fs/tmpfs/tmpfs_subr.c
252 ↗(On Diff #112141)

So it must be handled in tmpfs_pager_freespace(), thank you for noting this.

726 ↗(On Diff #112141)

Of course, this was non-tested late minute change to switch to tn_pages.

kib marked 2 inline comments as done.

Fix '-' in tmpfs_free_node().
Account for the freed swap space in tmpfs_pager_freespace()

tmpfs_pager_freespace() must charge both tmpfs_mount and tmpfs_node

Fix newly introduced tmpfs_pager_freespace(): it should count swap pages, not page queue (which is already cleared ATM).

Is it a bug that tmpfs doesn't call swap_reserve_by_cred() when extending a file?

sys/fs/tmpfs/tmpfs_subr.c
252 ↗(On Diff #112141)

I'm still confused about the scenario where we have a resident page and a copy exists in the pager. Should we count it once or twice? For tmpfs, it makes more sense to count it twice I think. But then the vm_pager_has_page() checks in tmpfs_page_inserted() and tmpfs_page_removed() should be removed, and tmpfs needs to intercept pageout ops and increment the count if a new swap block is to be allocated. And possibly putpages should return an error if the mount size limit would be exceeded.

If we count such a page just once, however, then tmpfs_pager_freespace() should only decrement the count if there is no resident copy of the page. Otherwise:

  1. Start with sparse file of size PAGE_SIZE (tn_pages == 0)
  2. Write 4KB to the file (tn_pages == 1)
  3. The resident page is written to the pager (tn_pages == 1)
  4. The page is dirtied again, so freespace is called at some point (tn_pages == 0)
  5. The file has a resident page but tn_pages == 0
sys/fs/tmpfs/tmpfs_vnops.c
475 ↗(On Diff #112244)

Should shm_stat() similarly avoid reporting holes?

476 ↗(On Diff #112244)

Is there any harm in a racy (i.e., unlocked) read?

sys/vm/vm_page.c
1486 ↗(On Diff #112244)

We should call this pager op in vm_page_rename() as well.

kib marked 5 inline comments as done.Nov 21 2022, 7:42 PM

Is it a bug that tmpfs doesn't call swap_reserve_by_cred() when extending a file?

Yes, it is. But tmpfs is consistent with this bug, the backing object is created with NULL credentials.

sys/fs/tmpfs/tmpfs_subr.c
252 ↗(On Diff #112141)

I believe that we should only count the page once in tn_pages, if it exists both in swap and in queue. tn_pages counts all page-sized ranges in the tmpfs node that are not holes.

I agree that there is a bug in swap_pager_freespace() then, should be fixed.

sys/fs/tmpfs/tmpfs_vnops.c
475 ↗(On Diff #112244)

Do you mean s/avoid/start/? I.e., make stat(2) on shmfd report blocks as really allocated pages. In principle this would be a useful addition, which I can do after we agree on this patch.

476 ↗(On Diff #112244)

Unlocked read on 32bit machines would give potentially torn value. I am not sure that it worth #ifdef-ing ILP32 there. I can.

475 ↗(On Diff #112137)

No, I do not think that it should be counted twice. tn_pages must be equal to the file size - holes size.

kib marked 4 inline comments as done.

Notify about page insertion in vm_page_rename(), which inlines vm_page_insert().
Do not report freed swap pages in swap_pager_meta_free(), if the valid page still exists in the queue.

shmfd: account for the actually allocated pages

Return the value as stat(2) st_blocks.

markj added inline comments.
sys/fs/tmpfs/tmpfs_vnops.c
476 ↗(On Diff #112244)

I suspect it is a common-enough operation that removing the locking for !ILP32 is worthwhile.

usr.bin/posixshmcontrol/posixshmcontrol.c
407 ↗(On Diff #113464)
448 ↗(On Diff #113464)

It looks like this test is inverted. error == -1 && errno == ENOTTY means that it's not a largepage object.

This revision is now accepted and ready to land.Nov 30 2022, 10:43 PM
kib marked 3 inline comments as done.

Fix sillyness in posixshmcontrol.
In tmpfs_stat(), only lock obj for 32bit.

This revision now requires review to proceed.Nov 30 2022, 11:22 PM
This revision is now accepted and ready to land.Nov 30 2022, 11:58 PM
vm_pager_allocate(): override resulting object type

For dynamically allocated pager type, which inherits the parent's alloc
method, type of the returned object is set to the parent's type
otherwise.

Also add a naive test for the shm pages accounting.

This revision now requires review to proceed.Dec 4 2022, 12:49 AM