Page MenuHomeFreeBSD

ipfw_nat: Perfomance of accessing multiple nat tables
Needs ReviewPublic

Authored by donner on Feb 8 2020, 9:27 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mar 15 2024, 7:36 PM
Unknown Object (File)
Feb 18 2024, 4:38 PM
Unknown Object (File)
Feb 18 2024, 4:20 PM
Unknown Object (File)
Dec 12 2023, 1:08 AM
Unknown Object (File)
Dec 9 2023, 6:56 AM
Unknown Object (File)
Nov 19 2023, 4:07 AM
Unknown Object (File)
Nov 17 2023, 2:27 AM
Unknown Object (File)
Nov 14 2023, 9:47 AM

Details

Reviewers
None
Group Reviewers
network
Summary

A scaling environment may require several NAT instances for the
amount of customers. This patch speed up the access of the NAT
instance in question while processing a packet in the ipfw rule.

It replaces the single unsorted list by a very simple hash of
lists. The hashing alogrithm works best with sequentially
numbered NAT instances.

Test Plan
# cat /boot/loader.conf  
net.inet.ip.fw.default_to_accept=1
net.inet.ip.fw.nat_hash=7
# kldload ipfw
# kldload ipfw_nat
# foreach i ( 1 2 3 4 5 6 7 8 9 10 11 12 )
> ipfw nat $i config ip 192.168.3.$i
> ipfw add nat $i ip from any to 192.168.3.$i
> ipfw add nat $i ip from 192.168.1.2 to 192.168.2.$i
> end
# kldunload ipfw_nat
# kldunload ipfw

Loading and unloading should work.
Natting should work regardless of the nat_hash value.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
No Lint Coverage
Unit
No Test Coverage
Build Status
Buildable 29306
Build 27209: arc lint + arc unit

Event Timeline

donner added a reviewer: network.
sys/netpfil/ipfw/ip_fw2.c
3373–3375

Module specific initialization moved into the module. Zeroized memory at initialization is sufficient to avoid erroneous behavior.

sys/netpfil/ipfw/ip_fw_nat.c
484–488

Avoid code duplication

We already have "nat tablearg" feature for this task:

single rule

ipfw add nat tablearg ip from any to 'table(1)'

for each $k and $i

ipfw table 1 add 192.168.$k.$i $k

Have you tried measuring performance of such configuration? It uses RADIX tree instead of linear seach.

bcr added a subscriber: bcr.

OK from manpages.

We already have "nat tablearg" feature for this task:

single rule

ipfw add nat tablearg ip from any to 'table(1)'

for each $k and $i

ipfw table 1 add 192.168.$k.$i $k

Have you tried measuring performance of such configuration? It uses RADIX tree instead of linear seach.

You missed the point: tablearg is for selecting the "number" of the nat instance given the rule processing.
This patch is about selecting the nat instance itself given the "number".

For some background about the amount of instances see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219316

donner retitled this revision from ipfw_nat: Perfomance of acceing multiple nat tables to ipfw_nat: Perfomance of accessing multiple nat tables.Feb 9 2020, 2:04 PM

We already have "nat tablearg" feature for this task:

The relevant lookup part in ip_fw2.c is

 t = ((ipfw_insn_nat *)cmd)->nat;
 if (t == NULL) {
    nat_id = TARG(cmd->arg1, nat);
    t = (*lookup_nat_ptr)(&chain->nat, nat_id);

    if (t == NULL) {
       retval = IP_FW_DENY;
       break;
    }
    if (cmd->arg1 != IP_FW_TARG)
        ((ipfw_insn_nat *)cmd)->nat = t;
}

The lookup for explicit referencing is cached (in cmd->nat), so the patch has no useful impact at all.
Using tableargs (what I want to do), the caching is disabled, and the lookup for the result of the tablearg is performed for each packet traversing the kernel. So the speedup of this patch is only realized, if the nat table is selected by a tablearg.

General comment: I'd prefer not to add non-resizable hashes. It should be system job, not user, to resize the hash. Unfortunately, there is no existing generic resizable hash primitive in the kernel code currently.

Speaking of this particular case, I would suggest doing it slightly differently.
We know that nat numbers are limited to 65k. Given that, we can simply allocate 65k array of pointers on the first addition of the nat rule, w/o bothering about hash efficiency, resizing, etc.

General comment: I'd prefer not to add non-resizable hashes. It should be system job, not user, to resize the hash. Unfortunately, there is no existing generic resizable hash primitive in the kernel code currently.

Speaking of this particular case, I would suggest doing it slightly differently.
We know that nat numbers are limited to 65k. Given that, we can simply allocate 65k array of pointers on the first addition of the nat rule, w/o bothering about hash efficiency, resizing, etc.

This approach has serious consequences:

  • It takes an non-trivial amount of kernel memory. I tend to avoid this.
  • It causes a significant performance hit in all cases where the full scan of all instances is required (i.e. dealiasing via "global" or "ipfw show"). This is the reason, why I tried to keep the hash table as small as possible.
  • Most use cases for the non-experienced user (or embedded devices) involves only a single NAT instance at all. I don't want to burden them with performance issues of large CGN providers.

Futhermore the benefit of the patch is only realized for tablearg selection of NAT instances. All other cases do access the NAT instances either by caching or by full scan.

In D23586#519077, @lutz_donnerhacke.de wrote:

General comment: I'd prefer not to add non-resizable hashes. It should be system job, not user, to resize the hash. Unfortunately, there is no existing generic resizable hash primitive in the kernel code currently.

Speaking of this particular case, I would suggest doing it slightly differently.
We know that nat numbers are limited to 65k. Given that, we can simply allocate 65k array of pointers on the first addition of the nat rule, w/o bothering about hash efficiency, resizing, etc.

This approach has serious consequences:

  • It takes an non-trivial amount of kernel memory. I tend to avoid this.

Well, we already allocate 2 such arrays for the rule index, so 512k won't drastically increase the footprint.
Anyway, I think even the dynamic-sized array capped by max instance number could be implemented relatively easily.

  • It causes a significant performance hit in all cases where the full scan of all instances is required (i.e. dealiasing via "global" or "ipfw show"). This is the reason, why I tried to keep the hash table as small as possible.

No. You still keep the list.

  • Most use cases for the non-experienced user (or embedded devices) involves only a single NAT instance at all. I don't want to burden them with performance issues of large CGN providers.

Not sure how this is applicable, as the array option does not have any configurable options.

Futhermore the benefit of the patch is only realized for tablearg selection of NAT instances. All other cases do access the NAT instances either by caching or by full scan.

Well, we already allocate 2 such arrays for the rule index, so 512k won't drastically increase the footprint.

Then we can get rid of the whole caching framework, which will decrease the code size considerably. Just lookup the nat instance directly (if the number is static or from tablearg), so more branches in this path. This way the idea become much more interesting.

  • It causes a significant performance hit in all cases where the full scan of all instances is required (i.e. dealiasing via "global" or "ipfw show"). This is the reason, why I tried to keep the hash table as small as possible.

No. You still keep the list.

Yep. Only for the purpose of scanning all instances.

Chance to a simple table based approach.
Remove the whole caching framework incl. special opcode extensions.
Keep the table small (dynamically allocated).
Move local managment structures from global into local file.
No need for externally visible tunables anymore, no man page changes.
Not yet fully tested (only module loading/unloading, rule creating, deletion).

This implementation certainly looks better, both eliminating old hacks and improving performance.
Some comments inline.

sys/netpfil/ipfw/ip_fw_nat.c
99

Is it nat_list or nat_mgmt or nat_ctl or nat_priv or ?

100

also: why all? Both list and array have all items.

Maybe it would be better to consider something like nat_list, nat_idxmap, nat_idxsize etc?
Also: can size be negative?

561

You cannot do M_WAITOK while holding non-sleepable lock.
Also, as this is configuration change, UH_WLOCK (configuration lock) should be sufficient for most of the control plane changes.

To be a bit more specific: nat list is not used in fast path anymore, hence there is no need to acquire IPFW_WLOCK for it, UH_WLOCK is sufficient.
For realloc: don't you want to add M_ZERO ?

Some generic comments: you almost never want to panic in case of memory allocation failure. Typically you either retry a couple of times or return an error to the caller.
Also, you almost never have the luxury of using M_WAITOK.

1169

Do alloc first and then assign to chain under both UH_WLOCK and IPFW_WLOCK.

Also, I'm not sure if we're allowed to sleep in vnet_init hooks. Probably yes, but better verify.

sys/netpfil/ipfw/ip_fw_sockopt.c
2006 ↗(On Diff #68243)

You have to keep backward compatibility. Old and new ipfw(8) binaries have to work.
I guess you can potentially accept both old & new opcode variations in the kernel, while changing ipfw(8) to use the new one, but this has to be verified.

sys/netpfil/ipfw/ip_fw_nat.c
100

Negative values is not only an issue with "size" but also with "id".
This is far more serious, I'm work on this.

sys/netpfil/ipfw/ip_fw_nat.c
561

M_ZERO is a really good idea. I just had a beautiful panic by configuring an intermediate (go backwards below the largest nat id) unassigend instance. Thank you for the comment. Reading it solved my issue as it happend.

Fix a lot of errors.
Renaming the struct and fields.
id numbers are not longer generic ints, but uint16_t.
Fix locking.
Document creation of new nat instances in a more readable way.
Alloc memory only if outside of any locks.

Updated to revision 358355.

Updated to revision 367040.

This patch causes a hanging kernel, if the ipfw ruleset is modified under stress.
To modify the ruleset, the system needs to be rebooted.

gbe added a subscriber: gbe.

Removing manpages since no man page change is involved.

ipfw_nat is loadable as separate module.
Why we are use M_IPFW memory type ?