- User Since
- Jul 28 2014, 7:43 PM (173 w, 4 d)
Mon, Nov 13
I like this change, and have not yet spotted any problems with it. The one change I might make is to add a few assertions before the final return(inp) in each case to assert that the held lock matches the requested lock.
Oct 20 2017
Overall, correcting missing locking around non-atomic sequences of option changes -- e.g., the read-modify-write sequences relating to IPSEC -- seems an excellent idea. I'm not convinced, however, that additional locking is required around many of the single-field in-place updates (e.g., of the TTL) on x86 (with TSO), and from our conversation in the transport-protocol call last night, it sounds like you are not encountering problems with them. However, it may make sense to make the change regardless, in order to improve maintainability -- i.e., in case those structure fields see other lock-protected updates than are currently in the source tree, which could themselves suffer races. And it seems unlikely that any are in fast paths, nor will experience substantial contention. @jhb will be in Cambridge next week and I'll chat with him about the atomic write aspect (e.g., in particular, while the stores to fields such as the TTL will be atomic on all architectures, guarantees about ordering and progress will be weaker on some, such as POWER and ARMv8), to get his take.
Sep 22 2017
Many of these changes are not sufficient to address the underlying problems in this code, as the lock covers loop iteration but fails to protect the stability of the 'ifp' or 'ia' pointer outside of the loop. It could be that this prevents the crash you are seeing as the issue is iteration over entries being deleted from the list, or that the problem you are seeing is masked by changes in timing (as it involves using an ifp, ia, etc, being removed). It would be useful to think a little harder about our goals here. @bz and I have been pondering for some time a somewhat coarser, rmlock-based synchronisation approach to use in these circumstances, but have not prototyped this yet.
This seems generally sensible.
Sep 21 2017
Seems good to me, although I notice that other object types (e.g., sockets, ..) don't have xrefs from the object-type list. Probably they should, but maybe as a separate commit?
May 22 2017
Sounds plausible, but I do wonder if the sysctl is currently a sufficient mature way to enable application development. Enabling it requires root, so it's not directly usable by end users on multiuser systems, and it also has global scope rather than just affecting applications that the developer is working on, which could change failure modes for a range of applications (such as desktop applications) that the developer has no interest in debugging and fixing. Is there some other mechanism we can add -- e.g., using ptrace(2) -- or setting an environmental variable that causes rtld to itself twiddle a per-process setting, that might offer a better real-world debugging experience?
May 18 2017
May 13 2017
One aspect I've been struggling with in this approach is duplication of the logic to find run-time linkers -- i.e., shifting responsibility for ELF header parsing from the kernel to userspace, which seems generally undesirable. One possibility might be to pass a capability to a directory relative to which the kernel should look for the interpreter. This would fail to address the "use a run-time linker other than the one in the binary" use case, but would allow the kernel to continue to own ELF header processing (and similar for non-ELF binaries).
May 1 2017
FYI, there is another important semantic difference between BSD and Linux extended attributes. In FreeBSD, ACLs are exposed (and manipulated) via separate vnode operations in VFS, and similarly ACL system calls, since our VFS is ACL-aware, whereas in Linux, they use the extended attribute system calls and inode operations to carry a variety of metadata including ACLs. As such, if extfs is implementing ACLs, we'll want a wrapper that maps them to/from FreeBSD ACL vnode operations on the way through. There is arguably a desire to do something similar in the linuxulator to ensure that ACL operations enter our VFS as ACL operations rather than EA operations.
Apr 22 2017
The expert on the UFS2 extattr code is phk, who wrote it. I believe UFS2 generally relies on the buffer cache to cache the extattr block associated with an inode, so that it follows normal LRU-like eviction rules, etc. I believe that the only time UFS2 keeps extattr data hung off the inode in a special in-memory buffer is during a multi-operation transaction started by VOP_OPENEXTATTR and a corresponding later VOP_CLOSEXTATTR. Between those two VOPs, if a buffer is present, writes occur against the buffer rather than against the buffer cache, allowing the writes to be batched atomically. Otherwise, I believe that UFS2 will simply issue updates to bits of buffer-cache-resident extattr data. Take a look at ffs_extread and ffs_extwrite for details.
FYI, the ufs_extattr.c implementation is for UFS1 only, where there isn't in-layout storage for attributes. UFS2 uses code in ffs_vnops.c, relying on an additional block hung off the inode, and is probably a better reference for this work. The UFS implementation provides transaction-like semantics for multi-EA update -- hence the open/close behaviour. That is desirable when working with multiple simultaneous MAC policies that are each adding metadata in different attributes. I'm not sure how useful that will be to FreeBSD ext2fs users in practice, but it is the semantics in our VFS as a result -- and is what implies the in-memory copy so that we can atomically commit all the updates. I wonder if the 'default' mapping for ext2_extattr_index_to_linux() is safe..? I'm not sure what namespaces exist in Linux these days, but it might be one prefers to be conservative and protect non-user namespaces from access by unprivileged users.
Apr 18 2017
FYI, it may be desirable to add a note about scoping of cpuset*(2) to the capsicum(4) man page. We should probably extend that man page in other ways to describe other sorts of scoping in place, but that's a separate task...
Apr 13 2017
Apr 11 2017
I'm not sure I approve of calling a local variable 'fakeerror'. Given that the only value it can take on is ENOTCONN, how about making it a boolean 'soerror_enotconn'?
Note that this is not the complete story: There's a separate issue with "interrupting" threads already blocked in I/O on sockets at shutdown(2) time. Lack of that support causes a test failure in the Java test suite (if I recall) because calling shutdown(2) on a socket from one thread while another thread is blocked in read(2)/recv(2) or write(2)/send(2) will not interrupt the blocked thread. This is due to the way we do locking and reference counting on file descriptors and sockets.
Apr 4 2017
I like the overall approach especially after various changes to do the checks only in the system calls themselves, not in the common helper functions.
Apr 3 2017
Mar 31 2017
Two whitespace fixes requested by @emaste.
Catch a couple of further instances of K&R prototypes not caught by current
Line wrap two overlong lines (with new type information) to 80 characters.
Mar 30 2017
Mar 29 2017
We are probably now at a commit candidate, if various reviewers wouldn't mind checking that they are happy with how the patch has ended up looking?
Restore higher stability level for DTrace probes, as otherwise the
DTrace command-line tool will reject use of the probes in its default
Thanks for these reviews; will make various changes and update the patch soon.
Mar 28 2017
Further updates to constant use in less(1) using a more recent LLVM.
Mar 27 2017
Minor update: remove unneeded #include that snuck in.
Mar 26 2017
FYI, I have now committed a man page for DPCPU(9) in r316003. It includes some (safe?) synchronisation patterns in its example code.
This is correct: you must make sure that you continue to access state on the CPU for which you acquired a mutex -- e.g., by caching a pointer to the per-CPU state you are accessing, in case migration takes place.
But that is racey. Preemption can in theory occur straight after I have verified that it hasn't. Looks like I need to use critical regions for now. I can live with that if you can?
Mar 23 2017
Just a few quick comments:
Mar 15 2017
How does this play out with non-native ABIs (e.g., the Linux emulator) -- I thought SYS_fork (etc) were ABI-specific system-call numbers?
Mar 13 2017
It would be more tempting to add the systrace_probe_func invocation at the end of fork_return() where the similar KTRACE probe fires (for similar reasons). Take a look at the call to ktrsysret(SYS_fork, 0, 0); for details.
Mar 4 2017
This seems like a sensible general change. I'm quite surprised it wasn't this way already (.. and sort of misremembered that it was -- IPSEC should always have been using the netisr...).
Feb 7 2017
I like the idea, and encourage you to proceed, but be aware that struct tcpcb is part of the user-visible ABI for monitoring tools (sigh). Someone should restock our supplies of padding someday.
Jan 28 2017
Jan 9 2017
Adding "-R" support is a good idea.
Jan 6 2017
I'm not sure if consumers of m_pulldown() make assumptions about writability or not. The man page doesn't mention that they should (or not) but this is more of an empirical question. As I recall, m_pulldown() is particularly popular in IPv6, so tagging Bjoern to perhaps take a look at this and see what he thinks.
Dec 7 2016
Nov 30 2016
Nov 22 2016
Nov 5 2016
Oct 20 2016
Mentor approval granted. (NB: not a technical review, but existing technical reviews here look good to go!)
Oct 7 2016
Oct 5 2016
Overall I like this approach, but there's an important experimental question as to whether this enables all the use cases we care about -- and, more generally, whether there are visible failure modes that might surprise application programmers. We also need to think quite hard to convince ourselves this maintains safe operation. Getting Jon Anderson, Ben Laurie, and David Drysdale to review the approach would be very useful.
Oct 3 2016
Oct 1 2016
Sep 30 2016
In general, this seems like a good idea. A bit of wordsmithing does help, and reviewing an updated commit candidate before it goes into the tree wouldn't hurt if you can tolerate another RTT with reviewers :-).
Sep 23 2016
I'm fine with exposing the hostname here -- the goal of Capsicum has always been to be pragmatic about getting software running where it doesn't violate isolation properties. You could argue that this is an information leak and/or might cause problems for deterministic replay-style applications of Capsicum -- but I'd rather we had more code working in a sandboxing. :-)
Sep 18 2016
High-level comments rather than a detailed code review:
Sep 13 2016
Should we also be ditching M_COPY() and/or switching it to M_COPYM() for consistency..?
Aug 29 2016
The comments could have to do with the au_qctrl structure, which uses "int", whereas the au_qctrl64 type uses "uint64_t". You can see the code handling the older structure a bit below this point, which likely has to do with compatibility with older Solaris/XNU versions rather than FreeBSD per se.
Aug 20 2016
Jul 28 2016
Looks good to me!
Jul 15 2016
I'd suggest avoiding any style changes in the initial copy of code to the new locations, so diffs can more easily be checked, and changes can be more easily merged. Apply style/comment/etc changes in a separate commit.
Jul 11 2016
I think using panic() here would be preferable to KASSERT().