In D45042#1029022, @alc wrote:In D45042#1028354, @gallatin wrote:In D45042#1028058, @alc wrote:In D45042#1027957, @markj wrote:Do we have any idea what the downsides of the change are? If we make the default 64KB, then I'd expect memory usage to increase; do we have any idea what the looks like? It'd be nice to, for example, compare memory usage on a newly booted system with and without this change.
I had the same question. It will clearly impact a lot of page granularity counters, at the very least causing some confusion for people who look at those counters, e.g.,
./include/jemalloc/internal/arena_inlines_b.h- arena_stats_add_u64(tsdn, &arena->stats, ./include/jemalloc/internal/arena_inlines_b.h- &arena->decay_dirty.stats->nmadvise, 1); ./include/jemalloc/internal/arena_inlines_b.h- arena_stats_add_u64(tsdn, &arena->stats, ./include/jemalloc/internal/arena_inlines_b.h: &arena->decay_dirty.stats->purged, extent_size >> LG_PAGE); ./include/jemalloc/internal/arena_inlines_b.h- arena_stats_sub_zu(tsdn, &arena->stats, &arena->stats.mapped, ./include/jemalloc/internal/arena_inlines_b.h- extent_size);However, it's not so obvious what the effect on the memory footprint will be. For example, the madvise(MADV_FREE) calls will have coarser granularity. If we set the page size to 64KB, then one in-use 4KB page within a 64KB region will be enough to block the application of madvise(MADV_FREE) to the other 15 pages. Quantifying the impact that this coarsening has will be hard.
This does, however, seem to be the intended workaround: https://github.com/jemalloc/jemalloc/issues/467
Buried in that issue is the claim that Firefox's builtin derivative version of jemalloc eliminated the statically compiled page size.
What direction does the kernel grow the vm map? They apparently reverted support for lg page size values larger than the runtime page size because it caused fragmentation when the kernel grows the vm map downwards..
Typically, existing map entries are only extended in an upward direction. For downward growing regions, e.g., stacks, new entries are created. Do you have a pointer to where this is discussed? I'm puzzled as to why the direction would be a factor.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Wed, May 8
Sun, May 5
In D45042#1028058, @alc wrote:In D45042#1027957, @markj wrote:Do we have any idea what the downsides of the change are? If we make the default 64KB, then I'd expect memory usage to increase; do we have any idea what the looks like? It'd be nice to, for example, compare memory usage on a newly booted system with and without this change.
I had the same question. It will clearly impact a lot of page granularity counters, at the very least causing some confusion for people who look at those counters, e.g.,
./include/jemalloc/internal/arena_inlines_b.h- arena_stats_add_u64(tsdn, &arena->stats, ./include/jemalloc/internal/arena_inlines_b.h- &arena->decay_dirty.stats->nmadvise, 1); ./include/jemalloc/internal/arena_inlines_b.h- arena_stats_add_u64(tsdn, &arena->stats, ./include/jemalloc/internal/arena_inlines_b.h: &arena->decay_dirty.stats->purged, extent_size >> LG_PAGE); ./include/jemalloc/internal/arena_inlines_b.h- arena_stats_sub_zu(tsdn, &arena->stats, &arena->stats.mapped, ./include/jemalloc/internal/arena_inlines_b.h- extent_size);However, it's not so obvious what the effect on the memory footprint will be. For example, the madvise(MADV_FREE) calls will have coarser granularity. If we set the page size to 64KB, then one in-use 4KB page within a 64KB region will be enough to block the application of madvise(MADV_FREE) to the other 15 pages. Quantifying the impact that this coarsening has will be hard.
This does, however, seem to be the intended workaround: https://github.com/jemalloc/jemalloc/issues/467
Buried in that issue is the claim that Firefox's builtin derivative version of jemalloc eliminated the statically compiled page size.
Thu, May 2
In D45042#1027404, @andrew wrote:I've been thinking about adding PAGE_SIZE_MAX/PAGE_SHIFT_MAX or similar to arm64 to define the largest page size the kernel could support. We could then use that here if it's defined.
Wed, May 1
In D40676#1027002, @markj wrote:In D40676#1027000, @gallatin wrote:After this change, ktrace output is littered with 'CAP system call not allowed: $SYSCALL' on systems w/o capsicum enabled, which is confusing and distracting. Can this please be reverted to behave without CAP output for systems w/o capsicum ?
This was done already in commit f239db4800ee9e7ff8485f96b7a68e6c38178c3b.
After this change, ktrace output is littered with 'CAP system call not allowed: $SYSCALL' on systems w/o capsicum enabled, which is confusing and distracting. Can this please be reverted to behave without CAP output for systems w/o capsicum ?
Tue, Apr 30
Mon, Apr 29
I consulted with @imp, and after a trip down the rabbit hole, we concluded that a header file consisting only of the definition of MAXPHYS is not creative (as this is the only way to express this in C) so it can't have copyright protection, and should simply be public domain.
Sun, Apr 28
- Update diff to avoid cutting/pasting MAXPHYS definition as per @kib's suggestion
Thu, Apr 18
I just tripped over this again when trying to use some of the 16K changes I have in my Netflix tree on a personal machine running a GENERIC kernel, so let's try this again in a different way.
Mon, Apr 15
Apr 5 2024
Thank you for adding that option.
Apr 3 2024
In D43504#1017265, @markj wrote:In D43504#1017242, @gallatin wrote:Below are the results from my testing. I'm sorry that it took so long.. I had to re-do testing from the start b/c the new machine was not exactly identical to the old (different BIOS rev) and was giving slightly different results.
The results are from 92Gb/s of traffic over a one hour period with 45-47K TCP connections established/No SDT probes: 56.4%
normal SDT 57.5%
new IP SDTs 57.9%
new IP SDT+ 56.6%
zero-costJust to be clear, "SDT+" is with the patch I supplied to provide new asm goto-based SDT probes? I'm not sure what the "zero-cost" line means.
This is just measuring CPU usage as reported by the scheduler?
I made some progress on the hot-patching implementation last week. I hope to have it ready fairly soon.
In D43504#1017265, @markj wrote:In D43504#1017242, @gallatin wrote:Below are the results from my testing. I'm sorry that it took so long.. I had to re-do testing from the start b/c the new machine was not exactly identical to the old (different BIOS rev) and was giving slightly different results.
The results are from 92Gb/s of traffic over a one hour period with 45-47K TCP connections established/No SDT probes: 56.4%
normal SDT 57.5%
new IP SDTs 57.9%
new IP SDT+ 56.6%
zero-costJust to be clear, "SDT+" is with the patch I supplied to provide new asm goto-based SDT probes? I'm not sure what the "zero-cost" line means.
This is just measuring CPU usage as reported by the scheduler?
I made some progress on the hot-patching implementation last week. I hope to have it ready fairly soon.
Below are the results from my testing. I'm sorry that it took so long.. I had to re-do testing from the start b/c the new machine was not exactly identical to the old (different BIOS rev) and was giving slightly different results.
The results are from 92Gb/s of traffic over a one hour period with 45-47K TCP connections established/
Mar 29 2024
In D43504#1014924, @gallatin wrote:OK, starting with an unpatched kernel & working my way through the patches. I'll report percent busy for unpatched and various patches on our original 100G server (based around Xeon E5-2697A v4, which tends to be a poster-child for cache misses, as it runs very close to the limits of its memory bandwidth. I'll be disabling powerd and using TCP RACK TCP's DGP pacing.
This will take several days, as it takes a while to load up a server, get a few hours of steady-state, and unload it gently,.
Mar 26 2024
Mar 25 2024
OK, starting with an unpatched kernel & working my way through the patches. I'll report percent busy for unpatched and various patches on our original 100G server (based around Xeon E5-2697A v4, which tends to be a poster-child for cache misses, as it runs very close to the limits of its memory bandwidth. I'll be disabling powerd and using TCP RACK TCP's DGP pacing.
Mar 23 2024
In D43504#1014487, @markj wrote:Regarding SDT hotpatching, the implementation[1] was written a long time ago, before we had "asm goto" in LLVM. It required a custom toolchain program[2].
Since then, "asm goto" support appeared in LLVM. It makes for a much simpler implementation. I hacked up part of it and posted a patch[3]. In particular, the patch makes use of asm goto to remove the branch and data access. (The probe site is moved to the end of the function in an unreachable block.) The actual hot-patching part isn't implemented and will take some more work, but this is enough to do some benchmarking to verify that the overhead really is minimal. @gallatin would you be able to verify this?
I would also appreciate any comments on the approach taken in the patch, keeping in mind that the MD bits are not yet implemented.
[1] https://people.freebsd.org/~markj/patches/sdt-zerocost/
[2] https://github.com/markjdb/sdtpatch
[3] https://reviews.freebsd.org/D44483
Mar 21 2024
In D43504#1013980, @tuexen wrote:I agree, that introducing these probes should not have a performance hit. Since they exist in Solaris, I was assuming that there is no substantial performance hit. I would really like to see zero cost probes going into the tree.
In D43504#1013965, @gallatin wrote:Guys, this is crazy. Every SDT probe does a test on a global variable. If this lands, it will cause a noticeable performance impact, especially in high packet rate workloads. Can we shelve this until / unless SDT is modified to insert nops rather than do tests on a global variable? Or put this under its own options EXTRA_IP_PROBES or something?
Guys, this is crazy. Every SDT probe does a test on a global variable. If this lands, it will cause a noticeable performance impact, especially in high packet rate workloads. Can we shelve this until / unless SDT is modified to insert nops rather than do tests on a global variable? Or put this under its own options EXTRA_IP_PROBES or something?
Mar 20 2024
In D44204#1011171, @zlei wrote:Generally looks good to me.
This tests well. On my test system hw.model: Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz, I see a 50% reduction in syscall latency for the lmbench lat_syscall test (0.89us -> 0.44us) when the system is idle.
Mar 19 2024
Do you think you could also add a 'bool __read_mostly hpts_userrret_hook = true;' controlled by a sysctl to avoid calling into htps at all from userret for folks that want to disable this entirely?
Mar 5 2024
In D44204#1008904, @melifaro wrote:LGTM, q - would it be possible to introduce ‘ip6po_<set|clear>_<field>’ inline functions and use them so we don’t accidentally miss setting/clearing up the relevant bit?
In D44204#1008766, @ae wrote:Probably you can simplify some similar checks in in6_src.c too, e.g. IP6PO_VALID_PKTINFO and IP6PO_VALID_NHINFO. Not sure how it impacts your cache misses measurements.
Added a comment explaining why the flags exist, as suggested by @glebius
In D44204#1008527, @bz wrote:Initially I thought we should name some better but the original structs have the same names so all good.
I have not checked if you got all the places but it looks good.
- added a blank like before ip6po_m, as requested by @bz
Mar 4 2024
Feb 21 2024
Feb 9 2024
Jan 23 2024
In D43504#991958, @kp wrote:
Jan 12 2024
In D43400#989566, @jhb wrote:For future reference, uploading diffs with context (e.g. with git-arc) makes reviewing easier. I don't think we need the vnet around if_rele() (in case it calls if_free), so I think this is correct.
Jan 11 2024
Jan 10 2024
In D43385#989059, @mav wrote:There is already a panic in apei_ge_handler(), based on total status severity. Do you see it not enough?
Jan 9 2024
Removed hunk that was Netflix specific
Jan 3 2024
I pinged Nvidia/Mellanox last week, and I'm still waiting to hear back to see if they can support AccECN in their NICs
Dec 26 2023
OK, I'm sorry, I was not aware of AccECN and its desired behavior of setting CWR on all segments.
Properly handling CWR is part of the NDIS spec... though the spec is broken, and says that "If the CWR bit in the TCP header of the large TCP packet is set, the miniport driver must set this bit in the TCP header of the first packet that it creates from the large TCP packet. The miniport driver may choose to set this bit in the TCP header of the last packet that it creates from the large TCP packet, although this is less desirable." [https://learn.microsoft.com/en-us/windows-hardware/drivers/network/offloading-the-segmentation-of-large-tcp-packets]
I'd prefer you add a feature flag so that NICs which do properly support CWR be able to use TSO, and avoid being pessimized by this.
Dec 11 2023
In D42988#980253, @glebius wrote:The if_afdata[] array comes from the old BSD times when it was expected that there would be support for many many address families (e.g. IPX, AppleTalk, etc). Right now it has only two entries AF_INET and AF_INET6. It is very very very unlikely it will ever get a third one. It is much more likely that the array will go away and we will have just ifp->if_inet and ifp->if_inet6. Or maybe something more complicated. Anyway, the access to this data is going to change anyway, so there is no point in overdesigning it right now. Any solution for the sake of IfAPI cleanness is acceptable.
I'll add couple more people to confirm or argue my statement.
Nov 16 2023
Nov 15 2023
- Tried to address @wulf 's feedback
Test signature as suggested by @imp
Nov 14 2023
Nov 11 2023
Nov 9 2023
It seems like this was obsoleted by:
Oct 26 2023
Wouldn't it be better to toggle the IFCAP2_BIT(IFCAP2_RXTLS4) and IFCAP2_BIT(IFCAP2_RXTLS6) than to add the check in mlx5e_tls_rx_snd_tag_alloc()?
Oct 12 2023
I'm abandoning this in favor of handling events directly (https://reviews.freebsd.org/D42158)
Oct 11 2023
In D42158#961862, @imp wrote:I don't suppose that there's a way to know if the GPE handler sleeps so we can warn / avoid it?
In D42158#961861, @andrew wrote:Would it be useful to add a tunable to revert to the current behaviour if we do find a machine that can't run the ACPI method from an ithread?
Update to add a tunable to run ged events in a deferred context, as suggested by @andrew
In D42141#961666, @jhb wrote:The duplicate events would be fixed by my other suggestion of using a dedicated struct task instead of calling AcpiOsExecute which allocates and schedules a new struct task each time.
My worry is that GED is too general a thing. It means "go run some random firmware-provided bit of AML that can do God knows what" when an interrupt occurs.
After digging in the spec for a bit, the description there for GED (5.6.9) is not very clear. There is one requirement for event handling in general for interrupts by the OS (OSPM) is to leave the interrupt source disabled (e.g. the GIC pin masked) until the ACPI control method has been executed (section 5.6.4 that talks about Generic Event Handling). The current acpi_ged driver definitely doesn't do that, and our interrupt model doesn't have a good way to cope with that (we re-enable the GIC pin after the ithread handler completes). We might just have to run the method synchronously and hope for the best. The spec doesn't mandate that these handlers are safe, but it does suggest that they should invoke Notify() from the AML for non-trivial event reporting, so it may be that these are safe. The _EVT handler is required by the spec to clear the interrupt so it doesn't keep firing.
Oct 10 2023
In D42141#961590, @andrew wrote:Do we know what types of GED events might sleep? I'm not sure this will work when there's only a single CPU as it will be identical to the pre-AP startup case.
Oct 9 2023
Oct 5 2023
Sep 13 2023
Sep 6 2023
Thank you; this fixes the weird performance problem that we've yet to fully root cause.