Solaris doors IPC implementation
Needs ReviewPublic
Actions

Authored by bnovkov on Jan 30 2022, 12:08 PM.

Details

Reviewers

Group Reviewers

Summary

This differential adds Solaris doors IPC support to the kernel for amd64 architectures. Solaris doors are a form
of lightweight inter-process communication which allows a thread to invoke a function
from another process, essentially mimicking a local remote procedure call.

A serving process may expose its service by creating a door that should then be attached to a file.
A client may then invoke the exposed procedure using a "door call" operation on the attached file, during
which the kernel chooses a thread in the serving process which will execute the procedure. The serving
thread will invoke a "door return" operation upon returning which will transfer any results to the calling thread.

The kernel maintains a list of doors for each process and a per-process door server thread pool, with the
possibility of each door having its own private thread pool. Currently, all door-related operations are done
via the "door" syscall which wraps several syscalls using subcodes.

When a door is created, the basic library spawns a new thread that invokes the "door_return" syscall.
This thread is then placed on a door thread pool and waits indefinitely.
When a "door call" syscall is invoked, the kernel picks a server thread from a thread pool and transfers any
arguments provided by the calling thread to the serving thread. Small amounts of data are simply copied,
while larger arguments are transferred using shared mappings of argument data. The calling thread then waits for results.
The selected server thread gets woken up and its frame is adjusted so the procedure attached to the invoked door gets executed
when we return to userspace from "door_return". When the procedure is finished, the thread invokes "door_return" once again, and
any results are transferred to the calling thread, which is then unblocked and returns to userspace.

Currently, doors are attached to an existing file by "latching" onto its vnode, replacing its vnops vector with
the "doorfs_vnops" vector. A new vnode, file type and vnops vector was added to achieve this
(VDOOR, DTYPE_DOOR, and doorfs_vnops respectively). The owning process may attach or detach the
door to or from a regular file using "fattach" and "fdetach", which are implemented as door syscalls.

Currently, the only unresolved issue is that there is no handling of thread termination due to signals
which can lead to kernel memory leaks.

Test Plan

Tested using a basic library with a suite of tests available at this git repo. Further testing would consist of porting software which uses doors IPC from Illumos.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

bnovkov created this revision.Jan 30 2022, 12:08 PM

Herald added a reviewer: brooks. · View Herald TranscriptJan 30 2022, 12:08 PM

Herald added a subscriber: imp. · View Herald Transcript

bnovkov requested review of this revision.Jan 30 2022, 12:08 PM

khng added a subscriber: khng.Jan 30 2022, 12:18 PM

arrowd added a subscriber: arrowd.Jan 30 2022, 12:31 PM

Can you explain why do we need this, at all?

[I stopped reading the patch after my error counter overflown]

sys/fs/doorfs/doorfs_vnops.c
112	Why do you look vnode interlock there?
172	I think this code is never executed at all.
174	Why?
sys/kern/kern_door.c
55	You reference to fp is invalid at this moment.
58	You cannot access f_vnode for random struct file. It is valid only for specific types.
84	Why do you think that it is good idea to use scheduler lock to sync anything related to doors why this assignment needs the lock how does it work if you check for td_door == NULL is not locked?
291	You must not create vnode without assigned mount point.
341	At this moment obj is invalid.
345	Suppose that you have two consecutive mappings, both consisting of one page. Next, suppose that two-page door_pages_lookup() request is issued. What would be the result there? Not to mention issues that you do not handle shadowing or non-resident pages.

Removed redundant locks and a few errors.

First off, thank you for taking the time to take a look at the patch!
This is my first patch and I greatly appreciate all feedback as I am overwhelmed by the vast complexity.
I am sure that there are more dubious lines of code lurking in the patch and I greatly appreciate all suggestions to rectify them.

As for the motivation, the project idea was listed on the IdeasPage which prompted me to write the patch in the first place.
This would make porting Illumos software which uses doors IPC possible and would also provide a fast and lightweight IPC mechanism.
For example, doors could be used instead of unix sockets for local communication.

sys/fs/doorfs/doorfs_vnops.c
112	Another redundant lock, this will be removed.
sys/kern/kern_door.c
84	You are right, that lock is completely redundant and will be removed.
291	Can you suggest a fix or point me in in the right direction?
345	I would assume that the function would return an error. Yes, I have not considered shadowing simply due to the fact that I was overwhelmed and could not find an appropriate solution. Do you have any suggestions how to fix this?

Removed unused includes in door_machdep.c

In D34097#771069, @bojan.novkovic_fer.hr wrote:

First off, thank you for taking the time to take a look at the patch!
This is my first patch and I greatly appreciate all feedback as I am overwhelmed by the vast complexity.
I am sure that there are more dubious lines of code lurking in the patch and I greatly appreciate all suggestions to rectify them.

As for the motivation, the project idea was listed on the IdeasPage which prompted me to write the patch in the first place.
This would make porting Illumos software which uses doors IPC possible and would also provide a fast and lightweight IPC mechanism.
For example, doors could be used instead of unix sockets for local communication.

Yes, proper implementation of this beast requires a lot of stuff. So you need to do some architectural work first, instead of hoping that ad-hock approach for hacking a lot of core structures would work.

For instance, AFAIR, Solaris has a concept of mounting over single vnode, which we don't, and which was used for doors. Do we want the same approach? If not, what is the right place to put doors-related data? This clearly depends on the desired lifecycle of the data, which needs to be defined explicitly.

Adding new vnode type has too much consequences for it to be worth the end goal. The fact that you have to somehow patch cd9660 and tmpfs is already a red flag.

How much of this code needs to live in kernel? Does the thread pool really have to be handled by kernel and not by userspace?
Another related question, why this cannot be pure userspace library? (I do not object against having some support code in kernel, but I believe it needs to be proven to be needed). Might be, some minimal kernel support would be enough, like an addition to umtx(2) operations, with the rest handled by userspace.

What should be the semantic of doors on exec? You clear all doors on fork, which is strange for fork, but leave it as is for exec.

How the server pool should react to the process stop/suspend requests?

Is it possible to make all of this non-intrusive, e.g. by containing all this code in a module? I do not mind if you need a pointer there and here, e.g. a scratchpad in struct proc, but scattering it all over the kernel, for an experiment, is too worrisome. Also see above about userspace implementation.

It seems you missed the MD files (amd64) in the patch.

sys/kern/kern_door.c
84	Can td be not curthread there? If not, you must assert that it is, but the question is, why pass td at all? If it can be, then you have an obvious race when two threads try to act on the same td and either you get a memory leak, or worse, a memory corruption from this, if operation happens in right (wrong) order.
291	I suspect that you need to define the architecture first. Your reclaim method (which is probably never called) is the litmus indicator of the issues you have . For instance, how should your code react to reclamation of the underlying ('prev' vnode in your terminology) ? I suspect that your door info must be cleared, but how would you know about the situation?
345	No, the function would not return an error, instead you operate on unrelated pages. The right way is probably vm_fault_quick_hold_pages(), which also would take care of your bugs with dereferencing random pointers after vm_map unlock. But I am not sure about required semantic so I cannot say definitely.
1619	So you removed the lock in the line I pointed out but left it everywhere else.

In D34097#771081, @kib wrote:

In D34097#771069, @bojan.novkovic_fer.hr wrote:

First off, thank you for taking the time to take a look at the patch!
This is my first patch and I greatly appreciate all feedback as I am overwhelmed by the vast complexity.
I am sure that there are more dubious lines of code lurking in the patch and I greatly appreciate all suggestions to rectify them.

As for the motivation, the project idea was listed on the IdeasPage which prompted me to write the patch in the first place.
This would make porting Illumos software which uses doors IPC possible and would also provide a fast and lightweight IPC mechanism.
For example, doors could be used instead of unix sockets for local communication.

Yes, proper implementation of this beast requires a lot of stuff. So you need to do some architectural work first, instead of hoping that ad-hock approach for hacking a lot of core structures would work.

For instance, AFAIR, Solaris has a concept of mounting over single vnode, which we don't, and which was used for doors. Do we want the same approach? If not, what is the right place to put doors-related data? This clearly depends on the desired lifecycle of the data, which needs to be defined explicitly.

Adding new vnode type has too much consequences for it to be worth the end goal. The fact that you have to somehow patch cd9660 and tmpfs is already a red flag.

How much of this code needs to live in kernel? Does the thread pool really have to be handled by kernel and not by userspace?
Another related question, why this cannot be pure userspace library? (I do not object against having some support code in kernel, but I believe it needs to be proven to be needed). Might be, some minimal kernel support would be enough, like an addition to umtx(2) operations, with the rest handled by userspace.

What should be the semantic of doors on exec? You clear all doors on fork, which is strange for fork, but leave it as is for exec.

How the server pool should react to the process stop/suspend requests?

Is it possible to make all of this non-intrusive, e.g. by containing all this code in a module? I do not mind if you need a pointer there and here, e.g. a scratchpad in struct proc, but scattering it all over the kernel, for an experiment, is too worrisome. Also see above about userspace implementation.

It seems you missed the MD files (amd64) in the patch.

I think that there may have been some misunderstandings regarding the architectural decisions in the patch and a general lack of a clearer picture, mostly due to the meager amount of information I provided and the sheer
size of the patch, so I will try to lay out and elaborate several key moments.

First off, the architecture itself is by no means ad-hoc or created on a whim and is instead a result of a detailed study of the Doors IPC mechanism. The architecture is very similar to the one in Illumos [1] (which was used as a point of reference throughout the whole patch, where applicable)
and is very self-contained (if we ignore the vnode-related stuff, the patch adds two variables to the proc and thread structures, modifies fork and exit to ensure door initialization and release and the rest of the mechanism resides entirely within kern_door.c).
It was designed to be an Illumos-compatible cleanroom implementation, not a carbon copy of the Illumos implementation.

The part that seems to be the most problematic is the addition of a new vnode type and vnode operations in general. Doors semantics dictate that a door (similarly to sockets) should be accessed and used as a file descriptor.
So right off the bat there needs to be a way of binding the door file descriptor to a filesystem path name. At first glance it seems that this could be solved only with a new door-related 'struct fileops', but this is impossible due to the semantics of the door lifecycle and 'fattach'.
All door related data is released when the per-door reference counter drops to 0. However, in Illumos (and in the defunct Linux implementation [2]) this reference counter is updated each time the door file is opened and closed. 'struct fileops' does not offer the possibility of overriding the opening function which
in turn means that this interface cannot be used for doors as we need reference counting to maintain a proper data lifecycle. This is part of the reason why I added a new vnode type since I saw no other way of controlling a door's reference count and lifecycle.

But the biggest issue here (and the main reason for a new vnode type) are the 'fattach' and 'fdetach' operations. Again, door semantics dictate that after a door is created it must be manually attached to a file in order to be accessible to other processes, instead
of e.g. creating the file during the door creation syscall. The implementation of these operations is the only thing in this patch that I would refer to as ad-hoc since it exposes the door by replacing an existing file vnode's v_op vector to the door_vnops vector.
I have searched through Illumos sources and found that their implementation of there operations relies on 'namefs' [3]. The sheer complexity, scope and impact of implementing an equivalent of such a filesystem immediately discouraged me from considering this as a solution.
I am personally not satisfied with this approach and would gladly welcome any other viable solution.

Of course, in spite of a detailed research of kernel sources I may have missed an obvious and less complex solution to these problems due to my limited kernel experience and I gladly welcome any possible alternatives.

As for the userspace implementation, it seems theoretically doable but I think it would bring more overhead and other problems to be viable.
Due to the architectural and semantic reasons described above, I don't see a way of containing this code to a module either.

Thank you for pointing out fork and exec semantics and process stop/suspend requests, these have not been properly addressed and I will update the diff to address this issue.

The MD files should also be present now.

[1] - https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/fs/doorfs/door_sys.c
[2] - http://www.rampant.org/doors/linux-doors.pdf
[3] - https://github.com/illumos/illumos-gate/tree/master/usr/src/uts/common/fs/namefs

sys/kern/kern_door.c
84	Thank you for pointing this out, td can only be curthread. I will update the code to use curthread instead of a td argument.
291	I have tried to elaborate on this issue in the comment. As for the reclamation of the previous vnode, I think that the situation would never arise but i need to think about it some more to be sure.
345	Thank you for the suggestion, I will look into it.
1619	Accidentally overlooked, this will be removed too.

Do you think fattach binding to a file in the file system namespace is a necessary characteristic that we need to replicate as well? If not, I think another less-compatible approach is to separate the namespace all together, just like the posixshm facilities.

include/door.h
12
27

In D34097#771160, @bojan.novkovic_fer.hr wrote:

I think that there may have been some misunderstandings regarding the architectural decisions in the patch and a general lack of a clearer picture, mostly due to the meager amount of information I provided and the sheer
size of the patch, so I will try to lay out and elaborate several key moments.

First off, the architecture itself is by no means ad-hoc or created on a whim and is instead a result of a detailed study of the Doors IPC mechanism. The architecture is very similar to the one in Illumos [1] (which was used as a point of reference throughout the whole patch, where applicable)
and is very self-contained (if we ignore the vnode-related stuff, the patch adds two variables to the proc and thread structures, modifies fork and exit to ensure door initialization and release and the rest of the mechanism resides entirely within kern_door.c).
It was designed to be an Illumos-compatible cleanroom implementation, not a carbon copy of the Illumos implementation.

By architecture I mean the way this thing integrates with the rest of the kernel. Starting with the decision to have a new vnode type, continuing to the behavior on fork/exec, (mis-)use of the VM KPI, locking and so on. I consider the API specification as given, but only for the external perimeter. What and how should be implemented, in kernel or in userspace, is all about the implementation architecture.

The part that seems to be the most problematic is the addition of a new vnode type and vnode operations in general.

I suspect that you have this impression only because I spent a little time looking at the VFS bits, but not on other parts.

Doors semantics dictate that a door (similarly to sockets) should be accessed and used as a file descriptor.
So right off the bat there needs to be a way of binding the door file descriptor to a filesystem path name. At first glance it seems that this could be solved only with a new door-related 'struct fileops', but this is impossible due to the semantics of the door lifecycle and 'fattach'.
All door related data is released when the per-door reference counter drops to 0. However, in Illumos (and in the defunct Linux implementation [2]) this reference counter is updated each time the door file is opened and closed. 'struct fileops' does not offer the possibility of overriding the opening function which
in turn means that this interface cannot be used for doors as we need reference counting to maintain a proper data lifecycle. This is part of the reason why I added a new vnode type since I saw no other way of controlling a door's reference count and lifecycle.

But the biggest issue here (and the main reason for a new vnode type) are the 'fattach' and 'fdetach' operations. Again, door semantics dictate that after a door is created it must be manually attached to a file in order to be accessible to other processes, instead
of e.g. creating the file during the door creation syscall. The implementation of these operations is the only thing in this patch that I would refer to as ad-hoc since it exposes the door by replacing an existing file vnode's v_op vector to the door_vnops vector.
I have searched through Illumos sources and found that their implementation of there operations relies on 'namefs' [3]. The sheer complexity, scope and impact of implementing an equivalent of such a filesystem immediately discouraged me from considering this as a solution.
I am personally not satisfied with this approach and would gladly welcome any other viable solution.

Basically, they implemented specific file system and mount it over the vnode which has door attached. This is what I mentioned earlier as the 'mount over the single vnode'.
An standalone ability to mount one vnode over another might be useful as is, for instance for implementation of namespaces.

The fact that your current approach cannot be modularized is a good indicator that it is not the right one.
What you do is architecturally flawed. It cannot work, really. It ignores the vnode lifecycle.

I seriously suggest you to rethink this from the ground, deciding how to split the implementation into the minimal invasive kernel part, and supporting userspace library.
This is without repeating that we need some justification for even considering taking this into sources without consumers.

Of course, in spite of a detailed research of kernel sources I may have missed an obvious and less complex solution to these problems due to my limited kernel experience and I gladly welcome any possible alternatives.

As for the userspace implementation, it seems theoretically doable but I think it would bring more overhead and other problems to be viable.
Due to the architectural and semantic reasons described above, I don't see a way of containing this code to a module either.

Thank you for pointing out fork and exec semantics and process stop/suspend requests, these have not been properly addressed and I will update the diff to address this issue.

The MD files should also be present now.

[1] - https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/fs/doorfs/door_sys.c
[2] - http://www.rampant.org/doors/linux-doors.pdf
[3] - https://github.com/illumos/illumos-gate/tree/master/usr/src/uts/common/fs/namefs

sys/kern/kern_door.c
291	As for the reclamation of the previous vnode, I think that the situation would never arise but i need to think about it some more to be sure. Why? I am really curious.

In D34097#771212, @khng wrote:

Do you think fattach binding to a file in the file system namespace is a necessary characteristic that we need to replicate as well? If not, I think another less-compatible approach is to separate the namespace all together, just like the posixshm facilities.

I don't think that it is a necessary part for the IPC mechanism itself and that it is mostly needed for compatibility.
Separating the namespace sounds like a good idea, and I would be more than glad to update the differential using this approach if you think it is the way to go.
This would even enable modularizing the current approach as Konstantin suggested.

In D34097#771358, @bojan.novkovic_fer.hr wrote:

In D34097#771212, @khng wrote:

Do you think fattach binding to a file in the file system namespace is a necessary characteristic that we need to replicate as well? If not, I think another less-compatible approach is to separate the namespace all together, just like the posixshm facilities.

I don't think that it is a necessary part for the IPC mechanism itself and that it is mostly needed for compatibility.
Separating the namespace sounds like a good idea, and I would be more than glad to update the differential using this approach if you think it is the way to go.

It sounds like a better idea to me to not interact with file system namespace, thus there will be no hassle dealing with file system events like mount/unmount. I am not sure if others oppose this approach though.

This would even enable modularizing the current approach as Konstantin suggested.

brooks added inline comments.Jan 31 2022, 9:22 PM

sys/kern/syscalls.master
3321	The memory footprint of new system calls should be describable using types and the SAL annotations documented at this top of this file unless there is an extraordinarily API compatibility need. Please add real system calls if possible once other issues have been sorted.

gbe added a subscriber: gbe.Feb 2 2022, 9:26 AM

After reading all your comments and great suggestions, I will start incorporating them into the differential. As per Konstantin's suggestion, I will move everything to a separate module, ditch filesystem interactions and separate the namespace in a posixshm-like manner as Ka Ho Ng suggested, move thread pool logic to userspace and generally offload as much functionality as possible to userspace.
Thank you for your feedback and advice!

In D34097#773229, @bojan.novkovic_fer.hr wrote:

After reading all your comments and great suggestions, I will start incorporating them into the differential. As per Konstantin's suggestion, I will move everything to a separate module, ditch filesystem interactions and separate the namespace in a posixshm-like manner as Ka Ho Ng suggested, move thread pool logic to userspace and generally offload as much functionality as possible to userspace.
Thank you for your feedback and advice!

If there is no attachment to the vnode, do you need the kernel support for it, at all? In fact, the proper way is to implement everything possible in userspace (and I suspect that it can be done completely in userspace, without vnode attachement). Then, if there are performance issues, you could consider identifying and implementing helper kernel primitives to optimize the critical path.

But without vnodes, IMO it is just a glorified wrapper around condvar, and the first implementation should try just that.

lwhsu added a subscriber: emaste.Jun 6 2022, 1:33 PM

lwhsu added a subscriber: lwhsu.