pass fault type to vm_pager_get_pages
AbandonedPublic
Actions

Authored by kmacy on May 28 2016, 10:40 PM.

Details

Reviewers

kib
alc
markj

Summary

Currently vm_pager_get_pages does not pass the fault type down to the device pager fault. Consequently, on FreeBSD the driver has to assume that it is a write fault for the sake of correctness. This adds a lot of additional overhead.

I'm less certain about this patch so ... try and give me the benefit of the doubt.

Test Plan

None of the existing code uses the extra argument so should have no impact.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

kmacy updated this revision to Diff 17048.May 28 2016, 10:40 PM

kmacy retitled this revision from to pass fault type to vm_pager_get_pages.

kmacy updated this object.

kmacy edited the test plan for this revision. (Show Details)

kmacy added reviewers: kib, alc, markj.

kmacy set the repository for this revision to rS FreeBSD src repository - subversion.

kmacy added a project: multimedia.

kmacy added subscribers: emaste, scottl.

Herald added a subscriber: imp. · View Herald TranscriptMay 28 2016, 10:40 PM

kmacy mentioned this in D6587: Extend device_pager to support prefaulting of device memory.May 28 2016, 10:42 PM

alc added inline comments.May 28 2016, 10:51 PM

sys/vm/vm_fault.c
1152	Please understand that this function only creates read and/or execute mappings. If it did otherwise, vm_fault_prefault() would break copy-on-write.

kmacy added inline comments.May 28 2016, 11:00 PM

sys/vm/vm_fault.c
1152	Pages of device memory are not fungible so obviously COW does not apply here. I added the PG_FICTITIOUS check as a proxy for !COW. Is there some easy way to convince the code to prefault writable mappings of PG_FICTITIOUS pages? Do we need to add a flag to the object to indicate that COW is not applicable, or is PG_FICTITIOUS good enough?

This patch doesn't include the necessary update to md and tmpfs. If there's any chance of it going in as is I will update the patch.

I revived my patch, it grown several mis-merges from the time of being written.

Main differences from yours is the use of VM_PROT_ALL for other places, and keeping VOP_GETPAGES() interface. I do not see it useful to pass fault type to VOP.

https://kib.kiev.ua/kib/cdev_prot.1.patch

In D6620#140020, @kib wrote:

I revived my patch, it grown several mis-merges from the time of being written.

Main differences from yours is the use of VM_PROT_ALL for other places, and keeping VOP_GETPAGES() interface. I do not see it useful to pass fault type to VOP.

https://kib.kiev.ua/kib/cdev_prot.1.patch

Thanks. I'll re-start with that.

It occurs to me that if I can identify the virtual address that the object starts at I don't actually need to change the device pager ABI beyond adding a new OBJ type to indicate that the fault routine will manage the pglist itself. If the fault routine knows the actual base of where the object is being mapped it can insert the additional mappings itself. Additionally, it means the VM doesn't need to learn that COW is not relevant to device memory.

It occurs to me that if I can identify the virtual address that the object starts at I don't actually need to change the device pager ABI beyond adding a new OBJ type to indicate that the fault routine will manage the pglist itself. If the fault routine knows the actual base of where the object is being mapped it can insert the additional mappings itself. Additionally, it means the VM doesn't need to learn that COW is not relevant to device memory.

It looks like there's no clean way to work backwards from an object to a map_entry. Admittedly, it would be a bit of a layering violation and would cause problems for aliasing. Perhaps rather than pass in count and *rahead we can pass in the actual fault address?

It looks like there's no clean way to work backwards from an object to a map_entry. Admittedly, it would be a bit of a layering violation and would cause problems for aliasing. Perhaps rather than pass in count and *rahead we can pass in the actual fault address?

I can find the entry by searching the map for the object at fault time. This gives me the start of the mapping. I can then cache that start and cache the map address to look up a new entry if a second process maps the device. In the instances where the driver's fault routine returns more than one page I can simply call pmap_enter_object after all the pages have been added to the object.

In summary, the only ABI change I need to is to permit the fault routine to manage devp_pglist (along with passing down the actual fault type).

Switch to kib's patch, dropping the changes to VOP_GETPAGES

In D6620#140142, @kmacy wrote:

I can find the entry by searching the map for the object at fault time. This gives me the start of the mapping. I can then cache that start and cache the map address to look up a new entry if a second process maps the device. In the instances where the driver's fault routine returns more than one page I can simply call pmap_enter_object after all the pages have been added to the object.

In summary, the only ABI change I need to is to permit the fault routine to manage devp_pglist (along with passing down the actual fault type).

Passing the mapping address is the huge layering violation. Digging the curproc map to find the faulting address is even larger one. What would happen if there is already another mapping in the same map for the same object ?

I do not think that these prefaulting changes are required for the driver to work. I suggest to put them aside for now, until the discussion of the cdevsw pager KPI goes forward.

Passing the mapping address is the huge layering violation. Digging the curproc map to find the faulting address is even larger one. What would happen if there is already another mapping in the same map for the same object ?

You would never have the same aperture mapped twice, and if you did it would not have any adverse effect. We really to need stop conflating device memory's behavior with RAM.

In D6620#140319, @kmacy wrote:

Passing the mapping address is the huge layering violation. Digging the curproc map to find the faulting address is even larger one. What would happen if there is already another mapping in the same map for the same object ?

You would never have the same aperture mapped twice, and if you did it would not have any adverse effect. We really to need stop conflating device memory's behavior with RAM.

If by aperture you mean the mappable portion of GTT, I in fact do not see why you statement is true. More, from what I remember from the times when I had to work on that code, it is in fact can be mapped more than once.

That said, I do not see what makes the device memory so special to break the VM architecture.

kmacy added a comment.May 30 2016, 4:56 PM

This comment was removed by kmacy.

kmacy added a comment.May 30 2016, 5:01 PM

This comment was removed by kmacy.

If by aperture you mean the mappable portion of GTT, I in fact do not see why you statement is true. More, from what I remember from the times when I had to work on that code, it is in fact can be mapped more than once.

Will d_mmap_single use the same object multiple times? I don't think so. A new object should be allocated each time it is called. However, if so that breaks key assumptions in the linux code.

That said, I do not see what makes the device memory so special to break the VM architecture.

Please look at what the modern linux driver does. As far as I can tell, this behavior is not supported by the FreeBSD VM. Having vm_fault be called 1000+ times when it should only be called once is not an acceptable solution. Currently the only way to make this work on FreeBSD is for the linuxkpi to parse the process' map. Thanks for any input.

Can we commit this patch of yours?

This is the core of the i915_gem_fault routine. You will notice how it calls vm_insert_pfn for every page in the view - which in my testing can be as many as 1030+ pages.

One thing I was wrong about - it does always use vma_prot which is equivalent to the entry's prot. So my prior change to prefault to pass the fault type was clearly incorrect.

	/* Try to flush the object off the GPU first without holding the lock.
	 * Upon reacquiring the lock, we will perform our sanity checks and then
	 * repeat the flush holding the lock in the normal manner to catch cases
	 * where we are gazumped.
	 */
	ret = i915_gem_object_wait_rendering__nonblocking(obj, NULL, !write);
	if (ret)
		goto unlock;

	/* Access to snoopable pages through the GTT is incoherent. */
	if (obj->cache_level != I915_CACHE_NONE && !HAS_LLC(dev)) {
		ret = -EFAULT;
		goto unlock;
	}

	/* Use a partial view if the object is bigger than the aperture. */
	if (obj->base.size >= dev_priv->gtt.mappable_end &&
	    obj->tiling_mode == I915_TILING_NONE) {
		static const unsigned int chunk_size = 256; // 1 MiB

		memset(&view, 0, sizeof(view));
		view.type = I915_GGTT_VIEW_PARTIAL;
		view.params.partial.offset = rounddown(page_offset, chunk_size);
		view.params.partial.size =
			min_t(unsigned int,
			      chunk_size,
			      (vma->vm_end - vma->vm_start)/PAGE_SIZE -
			      view.params.partial.offset);
	}

	/* Now pin it into the GTT if needed */
	ret = i915_gem_object_ggtt_pin(obj, &view, 0, PIN_MAPPABLE);
	if (ret)
		goto unlock;

	ret = i915_gem_object_set_to_gtt_domain(obj, write);
	if (ret)
		goto unpin;

	ret = i915_gem_object_get_fence(obj);
	if (ret)
		goto unpin;

	/* Finally, remap it using the new GTT offset */
	pfn = dev_priv->gtt.mappable_base +
		i915_gem_obj_ggtt_offset_view(obj, &view);
	pfn >>= PAGE_SHIFT;

	if (unlikely(view.type == I915_GGTT_VIEW_PARTIAL)) {
		/* Overriding existing pages in partial view does not cause
		 * us any trouble as TLBs are still valid because the fault
		 * is due to userspace losing part of the mapping or never
		 * having accessed it before (at this partials' range).
		 */
		unsigned long base = vma->vm_start +
				     (view.params.partial.offset << PAGE_SHIFT);
		unsigned int i;

		for (i = 0; i < view.params.partial.size; i++) {
			ret = vm_insert_pfn(vma, base + i * PAGE_SIZE, pfn + i);
			if (ret)
				break;
		}

		obj->fault_mappable = true;
	} else {
		if (!obj->fault_mappable) {
			unsigned long size = min_t(unsigned long,
						   vma->vm_end - vma->vm_start,
						   obj->base.size);
			int i;


			if (size >> PAGE_SHIFT > 1)
				atomic_add_long(&gem_possible_prefaults, (size >> PAGE_SHIFT) - 1);
			
			for (i = 0; i < size >> PAGE_SHIFT; i++) {
				ret = vm_insert_pfn(vma,
						    (unsigned long)vma->vm_start + i * PAGE_SIZE,
						    pfn + i);
				if (ret)
					break;
			}

			obj->fault_mappable = true;
		} else
			ret = vm_insert_pfn(vma,
					    (unsigned long)vmf->virtual_address,
					    pfn + page_offset);
	}

vm_insert_pfn does not apply to cow mappings and is only used by the device fault routine

/**
 * vm_insert_pfn - insert single pfn into user vma
 * @vma: user vma to map to
 * @addr: target user address of this page
 * @pfn: source kernel pfn
 *
 * Similar to vm_insert_page, this allows drivers to insert individual pages
 * they've allocated into a user vma. Same comments apply.
 *
 * This function should only be called from a vm_ops->fault handler, and
 * in that case the handler should return NULL.
 *
 * vma cannot be a COW mapping.
 *
 * As this is called only for pages that do not currently exist, we
 * do not need to flush old virtual caches or the TLB.
 */
int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
			unsigned long pfn)
{
	return vm_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot);
}

It ultimately filters down in to insert_pfn():

static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
			pfn_t pfn, pgprot_t prot)
{
	struct mm_struct *mm = vma->vm_mm;
	int retval;
	pte_t *pte, entry;
	spinlock_t *ptl;

	retval = -ENOMEM;
	pte = get_locked_pte(mm, addr, &ptl);
	if (!pte)
		goto out;
	retval = -EBUSY;
	if (!pte_none(*pte))
		goto out_unlock;

	/* Ok, finally just insert the thing.. */
	if (pfn_t_devmap(pfn))
		entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
	else
		entry = pte_mkspecial(pfn_t_pte(pfn, prot));
	set_pte_at(mm, addr, pte, entry);
	update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */

	retval = 0;
out_unlock:
	pte_unmap_unlock(pte, ptl);
out:
	return retval;
}

pfn_t_devmap is just a check against the PFN_DEV flag - (which by the way is a much more rational name than FreeBSD's rather suspect PG_FICTITIOUS).

static inline bool pfn_t_devmap(pfn_t pfn)
{
	const u64 flags = PFN_DEV|PFN_MAP;

	return (pfn.val & flags) == flags;
}

I think the rest of the code is self-explanatory.

In D6620#140367, @kmacy wrote:

Why cannot you just install all pages into the object queue at once ? I do not understand why do you want to also force-map them into userspace simultaneously. The difference would be in soft-faults indeed. But softfaults in vm_fault() would only result in looking up the page and installation of pte, without even touching the pager. More, I believe (but did not verified) that softfault would be handled by the vm_fault() 'fast path', which allows simultaneous faults handling (the object is read-locked).

Changing the interface to allow elimination of soft-faults is completely different and more complicated story. Doing what you said earlier, just throwing FreeBSD VM into window because it is not Linux, I cannot support.

In D6620#140369, @kib wrote:

In D6620#140367, @kmacy wrote:

Why cannot you just install all pages into the object queue at once ?

If a range is mapped more than once the page will be migrating between objects. This just adds complexity and means that soft fault won't necessarily work in those cases.

I do not understand why do you want to also force-map them into userspace simultaneously. The difference would be in soft-faults indeed. But softfaults in vm_fault() would only result in looking up the page and installation of pte, without even touching the pager. More, I believe (but did not verified) that softfault would be handled by the vm_fault() 'fast path', which allows simultaneous faults handling (the object is read-locked).

I don't understand why 1000x more traps in to the kernel is a win. This sounds like deference to the existing model for its own sake as opposed to any real value that it adds.

Changing the interface to allow elimination of soft-faults is completely different and more complicated story. Doing what you said earlier, just throwing FreeBSD VM into window because it is not Linux, I cannot support.

This is all happening in out-of-tree - essentially third part code. I'm not asking you to support anything other than committing _your_ patch. Now, CAN WE PLEASE move on to that?

@kib To clarify: I only need 1 change to the VM and that is provided by your patch. I abandoned the other revision changing the device pager interface. Everything else is handled out of tree in the (soon to be) ports linuxkpi.

kib mentioned this in D8466: Pass the fault type to the pager getpages routine.Nov 7 2016, 6:08 PM

Related change in rS309710, rS309712

In D6620#181218, @emaste wrote:

Related change in rS309710, rS309712

Just so we're all on the same page, those commits are not only related, but they replace this change. We should probably close this change to avoid any future confusion.

Replaced by populate, which although it still insists on doing a great deal of accounting not needed for device mappings (COW cargo culting AFAICT) that complicates mapping the same device memory at different VAs - it is quite close to what is expected by graphics drivers. Thanks go to Kostik and Alan.