- User Since
- Jul 28 2014, 7:43 PM (207 w, 5 d)
Jun 18 2018
Jun 17 2018
Jun 15 2018
(This would also allow us to more easily grep for kernel sizes and pointers exposed to userspace..)
Didn't quite catch this before it was committed. This isn't really a destructor, it's a close notification. Rather than confuse matters, as sockets have UMA destructors as well, this should probably be so_notify_close.
Rather than using raw integer types, I wonder if we should introduce new FreeBSD types -- e.g., ksize_t, kptr_t -- that wrap them, instead? They would still all map through to int64_t/uint64_t as required, but if we needed to change this in the future, we could do so more easily.
Jun 14 2018
Jun 13 2018
Jun 12 2018
I've not had a chance to read this patch in any detail as yet, as I've just been added as a reviewer. On the whole this seems a structurally safe change as long as exclusive acquisition of pcbinfo is still used to serialise multi-inpcb lock acquires, and particular care is taken around the listen/accept context, where there's TCP- and socket-stack recursion. I guess there is more generally a concern about the throughput of unreleased inpcbs in high connections/second turnover environments due to shifting to an epoch-based model for deferred frees -- e.g., high-volume TCP-based DNS service -- that it might be good to test in, if that's not already been done, as that's quite a different workload from long-lived and fairly fat CDN workloads.
Apr 26 2018
Mostly style and comment issues in this pass, although there are a few other queries -- e.g., relating to so and inh_so lock order.
Apr 23 2018
There is a surprising lack of locking assertions in the substantial new chunks of code proposed for in_pcb.c and in6_pcb.c. Could you add proper locking assertions so that we can better understand the locking protocol? Please also add a comment above new data-structure definitions to make clear what the locking requirements are. Please document the per-field locking requirements in in_pcb.h.
Feb 12 2018
Seems sensible. I've given this a quick but not detailed read looking for transcription errors.
Seems entirely sensible. I've given it a quick read and it seems fine, but not a detailed read for transcription errors or compilation problems.
Feb 3 2018
Looks good to me, and long overdue.
Looks good to me -- and long overdue.
Dec 6 2017
It would be nice to ditch the requirement for exclusive vnode lock acquisition in VFS for extended attribute writes. But I'm not sure the filesystems would approve of that..?
Dec 4 2017
Dec 1 2017
Overall, if the vn_extattr_*(9) KPIs are not sufficient to implement the required semantics, then we should consider ways to improve them to avoid the code duplication and errors found in this patch. I'm a bit puzzled by the need to query attributes to check immutable flags, etc. explicitly -- the filesystem should be performing these checks already. Unless something else is wrong..?
Nov 13 2017
I like this change, and have not yet spotted any problems with it. The one change I might make is to add a few assertions before the final return(inp) in each case to assert that the held lock matches the requested lock.
Oct 20 2017
Overall, correcting missing locking around non-atomic sequences of option changes -- e.g., the read-modify-write sequences relating to IPSEC -- seems an excellent idea. I'm not convinced, however, that additional locking is required around many of the single-field in-place updates (e.g., of the TTL) on x86 (with TSO), and from our conversation in the transport-protocol call last night, it sounds like you are not encountering problems with them. However, it may make sense to make the change regardless, in order to improve maintainability -- i.e., in case those structure fields see other lock-protected updates than are currently in the source tree, which could themselves suffer races. And it seems unlikely that any are in fast paths, nor will experience substantial contention. @jhb will be in Cambridge next week and I'll chat with him about the atomic write aspect (e.g., in particular, while the stores to fields such as the TTL will be atomic on all architectures, guarantees about ordering and progress will be weaker on some, such as POWER and ARMv8), to get his take.
Sep 22 2017
Many of these changes are not sufficient to address the underlying problems in this code, as the lock covers loop iteration but fails to protect the stability of the 'ifp' or 'ia' pointer outside of the loop. It could be that this prevents the crash you are seeing as the issue is iteration over entries being deleted from the list, or that the problem you are seeing is masked by changes in timing (as it involves using an ifp, ia, etc, being removed). It would be useful to think a little harder about our goals here. @bz and I have been pondering for some time a somewhat coarser, rmlock-based synchronisation approach to use in these circumstances, but have not prototyped this yet.
This seems generally sensible.
Sep 21 2017
Seems good to me, although I notice that other object types (e.g., sockets, ..) don't have xrefs from the object-type list. Probably they should, but maybe as a separate commit?
May 22 2017
Sounds plausible, but I do wonder if the sysctl is currently a sufficient mature way to enable application development. Enabling it requires root, so it's not directly usable by end users on multiuser systems, and it also has global scope rather than just affecting applications that the developer is working on, which could change failure modes for a range of applications (such as desktop applications) that the developer has no interest in debugging and fixing. Is there some other mechanism we can add -- e.g., using ptrace(2) -- or setting an environmental variable that causes rtld to itself twiddle a per-process setting, that might offer a better real-world debugging experience?
May 18 2017
May 13 2017
One aspect I've been struggling with in this approach is duplication of the logic to find run-time linkers -- i.e., shifting responsibility for ELF header parsing from the kernel to userspace, which seems generally undesirable. One possibility might be to pass a capability to a directory relative to which the kernel should look for the interpreter. This would fail to address the "use a run-time linker other than the one in the binary" use case, but would allow the kernel to continue to own ELF header processing (and similar for non-ELF binaries).
May 1 2017
FYI, there is another important semantic difference between BSD and Linux extended attributes. In FreeBSD, ACLs are exposed (and manipulated) via separate vnode operations in VFS, and similarly ACL system calls, since our VFS is ACL-aware, whereas in Linux, they use the extended attribute system calls and inode operations to carry a variety of metadata including ACLs. As such, if extfs is implementing ACLs, we'll want a wrapper that maps them to/from FreeBSD ACL vnode operations on the way through. There is arguably a desire to do something similar in the linuxulator to ensure that ACL operations enter our VFS as ACL operations rather than EA operations.
Apr 22 2017
The expert on the UFS2 extattr code is phk, who wrote it. I believe UFS2 generally relies on the buffer cache to cache the extattr block associated with an inode, so that it follows normal LRU-like eviction rules, etc. I believe that the only time UFS2 keeps extattr data hung off the inode in a special in-memory buffer is during a multi-operation transaction started by VOP_OPENEXTATTR and a corresponding later VOP_CLOSEXTATTR. Between those two VOPs, if a buffer is present, writes occur against the buffer rather than against the buffer cache, allowing the writes to be batched atomically. Otherwise, I believe that UFS2 will simply issue updates to bits of buffer-cache-resident extattr data. Take a look at ffs_extread and ffs_extwrite for details.
FYI, the ufs_extattr.c implementation is for UFS1 only, where there isn't in-layout storage for attributes. UFS2 uses code in ffs_vnops.c, relying on an additional block hung off the inode, and is probably a better reference for this work. The UFS implementation provides transaction-like semantics for multi-EA update -- hence the open/close behaviour. That is desirable when working with multiple simultaneous MAC policies that are each adding metadata in different attributes. I'm not sure how useful that will be to FreeBSD ext2fs users in practice, but it is the semantics in our VFS as a result -- and is what implies the in-memory copy so that we can atomically commit all the updates. I wonder if the 'default' mapping for ext2_extattr_index_to_linux() is safe..? I'm not sure what namespaces exist in Linux these days, but it might be one prefers to be conservative and protect non-user namespaces from access by unprivileged users.
Apr 18 2017
FYI, it may be desirable to add a note about scoping of cpuset*(2) to the capsicum(4) man page. We should probably extend that man page in other ways to describe other sorts of scoping in place, but that's a separate task...
Apr 13 2017
Apr 11 2017
I'm not sure I approve of calling a local variable 'fakeerror'. Given that the only value it can take on is ENOTCONN, how about making it a boolean 'soerror_enotconn'?
Note that this is not the complete story: There's a separate issue with "interrupting" threads already blocked in I/O on sockets at shutdown(2) time. Lack of that support causes a test failure in the Java test suite (if I recall) because calling shutdown(2) on a socket from one thread while another thread is blocked in read(2)/recv(2) or write(2)/send(2) will not interrupt the blocked thread. This is due to the way we do locking and reference counting on file descriptors and sockets.
Apr 4 2017
I like the overall approach especially after various changes to do the checks only in the system calls themselves, not in the common helper functions.
Apr 3 2017
Mar 31 2017
Two whitespace fixes requested by @emaste.
Catch a couple of further instances of K&R prototypes not caught by current
Line wrap two overlong lines (with new type information) to 80 characters.
Mar 30 2017
Mar 29 2017
We are probably now at a commit candidate, if various reviewers wouldn't mind checking that they are happy with how the patch has ended up looking?
Restore higher stability level for DTrace probes, as otherwise the
DTrace command-line tool will reject use of the probes in its default
Thanks for these reviews; will make various changes and update the patch soon.
Mar 28 2017
Further updates to constant use in less(1) using a more recent LLVM.
Mar 27 2017
Minor update: remove unneeded #include that snuck in.
Mar 26 2017
FYI, I have now committed a man page for DPCPU(9) in r316003. It includes some (safe?) synchronisation patterns in its example code.
This is correct: you must make sure that you continue to access state on the CPU for which you acquired a mutex -- e.g., by caching a pointer to the per-CPU state you are accessing, in case migration takes place.
But that is racey. Preemption can in theory occur straight after I have verified that it hasn't. Looks like I need to use critical regions for now. I can live with that if you can?
Mar 23 2017
Just a few quick comments:
Mar 15 2017
How does this play out with non-native ABIs (e.g., the Linux emulator) -- I thought SYS_fork (etc) were ABI-specific system-call numbers?
Mar 13 2017
It would be more tempting to add the systrace_probe_func invocation at the end of fork_return() where the similar KTRACE probe fires (for similar reasons). Take a look at the call to ktrsysret(SYS_fork, 0, 0); for details.
Mar 4 2017
This seems like a sensible general change. I'm quite surprised it wasn't this way already (.. and sort of misremembered that it was -- IPSEC should always have been using the netisr...).
Feb 7 2017
I like the idea, and encourage you to proceed, but be aware that struct tcpcb is part of the user-visible ABI for monitoring tools (sigh). Someone should restock our supplies of padding someday.
Jan 28 2017
Jan 9 2017
Adding "-R" support is a good idea.
Jan 6 2017
I'm not sure if consumers of m_pulldown() make assumptions about writability or not. The man page doesn't mention that they should (or not) but this is more of an empirical question. As I recall, m_pulldown() is particularly popular in IPv6, so tagging Bjoern to perhaps take a look at this and see what he thinks.